VibeVoice
An open-source family of frontier voice AI models from Microsoft, including long-form TTS, multi-speaker speech synthesis, real-time streaming TTS, and long-form ASR with speaker diarization.
At a Glance
Fully free and open-source under the MIT License. All model weights and code are publicly available.
Engagement
Available On
Alternatives
Listed Apr 2026
About VibeVoice
VibeVoice is a family of open-source frontier voice AI models developed by Microsoft Research, covering both Text-to-Speech (TTS) and Automatic Speech Recognition (ASR). It uses continuous speech tokenizers operating at an ultra-low frame rate of 7.5 Hz and a next-token diffusion framework combining a Large Language Model with a diffusion head for high-fidelity audio generation. The project is released under the MIT License and is intended for research and development purposes.
Key models and features include:
- VibeVoice-ASR — A unified speech-to-text model that handles up to 60-minute long-form audio in a single pass, producing structured transcriptions with speaker identity (Who), timestamps (When), and content (What). Supports 50+ languages and customized hotwords.
- VibeVoice-TTS — A long-form multi-speaker TTS model capable of synthesizing up to 90 minutes of speech with up to 4 distinct speakers. Supports English, Chinese, and other languages with expressive, natural-sounding output.
- VibeVoice-Realtime-0.5B — A lightweight 0.5B parameter real-time streaming TTS model with ~300ms first-audible latency, supporting streaming text input and robust long-form generation (~10 minutes).
- Hugging Face Integration — All model weights are available on Hugging Face Hub; VibeVoice-ASR is natively supported via the Hugging Face Transformers library.
- vLLM Inference Support — VibeVoice-ASR supports vLLM for accelerated inference.
- Finetuning Support — Finetuning code for VibeVoice-ASR is publicly available in the repository.
- Google Colab Demos — Interactive Colab notebooks are provided for quick experimentation with streaming TTS and realtime models.
- Next-Token Diffusion Architecture — Core innovation using acoustic and semantic tokenizers at 7.5 Hz for efficient long-sequence processing while preserving audio fidelity.
Community Discussions
Be the first to start a conversation about VibeVoice
Share your experience with VibeVoice, ask questions, or help others learn from your insights.
Pricing
Open Source (MIT)
Fully free and open-source under the MIT License. All model weights and code are publicly available.
- VibeVoice-ASR model weights
- VibeVoice-Realtime-0.5B model weights
- ASR finetuning code
- Colab demo notebooks
- Hugging Face Transformers integration
Capabilities
Key Features
- Long-form ASR up to 60 minutes in a single pass
- Speaker diarization with timestamps
- Customized hotword support
- 50+ language multilingual ASR
- Long-form multi-speaker TTS up to 90 minutes
- Up to 4 distinct speakers in a single TTS pass
- Real-time streaming TTS with ~300ms latency
- Next-token diffusion architecture
- vLLM inference support
- Hugging Face Transformers integration
- Finetuning code available
- Google Colab demos
