# VibeVoice

> An open-source family of frontier voice AI models from Microsoft, including long-form TTS, multi-speaker speech synthesis, real-time streaming TTS, and long-form ASR with speaker diarization.

VibeVoice is a **family of open-source frontier voice AI models** developed by Microsoft Research, covering both Text-to-Speech (TTS) and Automatic Speech Recognition (ASR). It uses continuous speech tokenizers operating at an ultra-low frame rate of 7.5 Hz and a next-token diffusion framework combining a Large Language Model with a diffusion head for high-fidelity audio generation. The project is released under the MIT License and is intended for research and development purposes.

Key models and features include:

- **VibeVoice-ASR** — *A unified speech-to-text model that handles up to 60-minute long-form audio in a single pass, producing structured transcriptions with speaker identity (Who), timestamps (When), and content (What). Supports 50+ languages and customized hotwords.*
- **VibeVoice-TTS** — *A long-form multi-speaker TTS model capable of synthesizing up to 90 minutes of speech with up to 4 distinct speakers. Supports English, Chinese, and other languages with expressive, natural-sounding output.*
- **VibeVoice-Realtime-0.5B** — *A lightweight 0.5B parameter real-time streaming TTS model with ~300ms first-audible latency, supporting streaming text input and robust long-form generation (~10 minutes).*
- **Hugging Face Integration** — *All model weights are available on Hugging Face Hub; VibeVoice-ASR is natively supported via the Hugging Face Transformers library.*
- **vLLM Inference Support** — *VibeVoice-ASR supports vLLM for accelerated inference.*
- **Finetuning Support** — *Finetuning code for VibeVoice-ASR is publicly available in the repository.*
- **Google Colab Demos** — *Interactive Colab notebooks are provided for quick experimentation with streaming TTS and realtime models.*
- **Next-Token Diffusion Architecture** — *Core innovation using acoustic and semantic tokenizers at 7.5 Hz for efficient long-sequence processing while preserving audio fidelity.*

## Features
- Long-form ASR up to 60 minutes in a single pass
- Speaker diarization with timestamps
- Customized hotword support
- 50+ language multilingual ASR
- Long-form multi-speaker TTS up to 90 minutes
- Up to 4 distinct speakers in a single TTS pass
- Real-time streaming TTS with ~300ms latency
- Next-token diffusion architecture
- vLLM inference support
- Hugging Face Transformers integration
- Finetuning code available
- Google Colab demos

## Integrations
Hugging Face Transformers, vLLM, Google Colab, Gradio

## Platforms
WINDOWS, LINUX, API, DEVELOPER_SDK, CLI

## Pricing
Open Source

## Links
- Website: https://microsoft.github.io/VibeVoice/
- Documentation: https://github.com/microsoft/VibeVoice/tree/main/docs
- Repository: https://github.com/microsoft/VibeVoice
- EveryDev.ai: https://www.everydev.ai/tools/vibevoice
