VibeVoice

Name: VibeVoice
Availability: OnlineOnly
Author: Microsoft

An open-source family of frontier voice AI models from Microsoft, including long-form TTS, multi-speaker speech synthesis, real-time streaming TTS, and long-form ASR with speaker diarization.

Visit Website

At a Glance

Pricing

Open Source

Fully free and open-source under the MIT License. All model weights and code are publicly available.

Engagement

Available On

Windows

Linux

API

SDK

CLI

MicrosoftOne Microsoft Way, Washington 98052-7329Est. 1975$30B raised

Listed Apr 2026

About VibeVoice

VibeVoice is a family of open-source frontier voice AI models developed by Microsoft Research, covering both Text-to-Speech (TTS) and Automatic Speech Recognition (ASR). It uses continuous speech tokenizers operating at an ultra-low frame rate of 7.5 Hz and a next-token diffusion framework combining a Large Language Model with a diffusion head for high-fidelity audio generation. The project is released under the MIT License and is intended for research and development purposes.

Key models and features include:

VibeVoice-ASR — A unified speech-to-text model that handles up to 60-minute long-form audio in a single pass, producing structured transcriptions with speaker identity (Who), timestamps (When), and content (What). Supports 50+ languages and customized hotwords.
VibeVoice-TTS — A long-form multi-speaker TTS model capable of synthesizing up to 90 minutes of speech with up to 4 distinct speakers. Supports English, Chinese, and other languages with expressive, natural-sounding output.
VibeVoice-Realtime-0.5B — A lightweight 0.5B parameter real-time streaming TTS model with ~300ms first-audible latency, supporting streaming text input and robust long-form generation (~10 minutes).
Hugging Face Integration — All model weights are available on Hugging Face Hub; VibeVoice-ASR is natively supported via the Hugging Face Transformers library.
vLLM Inference Support — VibeVoice-ASR supports vLLM for accelerated inference.
Finetuning Support — Finetuning code for VibeVoice-ASR is publicly available in the repository.
Google Colab Demos — Interactive Colab notebooks are provided for quick experimentation with streaming TTS and realtime models.
Next-Token Diffusion Architecture — Core innovation using acoustic and semantic tokenizers at 7.5 Hz for efficient long-sequence processing while preserving audio fidelity.

Community Discussions

Be the first to start a conversation about VibeVoice

Share your experience with VibeVoice, ask questions, or help others learn from your insights.

Pricing

OPEN SOURCE

Open Source (MIT)

Fully free and open-source under the MIT License. All model weights and code are publicly available.

VibeVoice-ASR model weights
VibeVoice-Realtime-0.5B model weights
ASR finetuning code
Colab demo notebooks
Hugging Face Transformers integration

Capabilities

Key Features

Long-form ASR up to 60 minutes in a single pass
Speaker diarization with timestamps
Customized hotword support
50+ language multilingual ASR
Long-form multi-speaker TTS up to 90 minutes
Up to 4 distinct speakers in a single TTS pass
Real-time streaming TTS with ~300ms latency
Next-token diffusion architecture
vLLM inference support
Hugging Face Transformers integration
Finetuning code available
Google Colab demos

Integrations

Hugging Face Transformers

vLLM

Google Colab

Gradio

API Available

View Docs

Back to all tools

VibeVoice

Speech Recognition

An open-source family of frontier voice AI models from Microsoft, including long-form TTS, multi-speaker speech synthesis, real-time streaming TTS, and long-form ASR with speaker diarization.

Visit Website

At a Glance

Pricing

Open Source

Fully free and open-source under the MIT License. All model weights and code are publicly available.

Engagement

Discussions

Available On

Windows

Linux

API

SDK

CLI

Resources

Website Docs GitHub llms.txt

Topics

Speech Recognition Voice Synthesis Generative Media

Alternatives

KlicStudio Sarvam AI Resemble AI

Developer

MicrosoftOne Microsoft Way, Washington 98052-7329Est. 1975$30B raised

Listed Apr 2026

About VibeVoice

Key models and features include:

VibeVoice-ASR — A unified speech-to-text model that handles up to 60-minute long-form audio in a single pass, producing structured transcriptions with speaker identity (Who), timestamps (When), and content (What). Supports 50+ languages and customized hotwords.
VibeVoice-TTS — A long-form multi-speaker TTS model capable of synthesizing up to 90 minutes of speech with up to 4 distinct speakers. Supports English, Chinese, and other languages with expressive, natural-sounding output.
VibeVoice-Realtime-0.5B — A lightweight 0.5B parameter real-time streaming TTS model with ~300ms first-audible latency, supporting streaming text input and robust long-form generation (~10 minutes).
Hugging Face Integration — All model weights are available on Hugging Face Hub; VibeVoice-ASR is natively supported via the Hugging Face Transformers library.
vLLM Inference Support — VibeVoice-ASR supports vLLM for accelerated inference.
Finetuning Support — Finetuning code for VibeVoice-ASR is publicly available in the repository.
Google Colab Demos — Interactive Colab notebooks are provided for quick experimentation with streaming TTS and realtime models.
Next-Token Diffusion Architecture — Core innovation using acoustic and semantic tokenizers at 7.5 Hz for efficient long-sequence processing while preserving audio fidelity.

Community Discussions

Be the first to start a conversation about VibeVoice

Share your experience with VibeVoice, ask questions, or help others learn from your insights.

Pricing

OPEN SOURCE

Open Source (MIT)

Fully free and open-source under the MIT License. All model weights and code are publicly available.

VibeVoice-ASR model weights
VibeVoice-Realtime-0.5B model weights
ASR finetuning code
Colab demo notebooks
Hugging Face Transformers integration

Capabilities

Key Features

Long-form ASR up to 60 minutes in a single pass
Speaker diarization with timestamps
Customized hotword support
50+ language multilingual ASR
Long-form multi-speaker TTS up to 90 minutes
Up to 4 distinct speakers in a single TTS pass
Real-time streaming TTS with ~300ms latency
Next-token diffusion architecture
vLLM inference support
Hugging Face Transformers integration
Finetuning code available
Google Colab demos

Integrations

Hugging Face Transformers

vLLM

Google Colab

Gradio

API Available

View Docs

Back to all tools