Dia is an open-source text-to-speech model by Nari Labs that generates realistic dialogue audio with multiple speakers, emotions, and non-verbal sounds from transcripts.
At a Glance
Pricing
Fully open-source model available for free download and local use.
Engagement
Available On
Listed Mar 2026
About Dia
Dia is an open-source 1.6B parameter text-to-speech model developed by Nari Labs, designed to generate highly realistic dialogue directly from transcripts. It supports multi-speaker audio generation, non-verbal cues like laughter and coughing, and fine-grained emotion and tone control. Dia can also perform voice cloning using an audio reference, making it a powerful tool for content creators, researchers, and developers building conversational AI applications.
- Multi-speaker dialogue generation: Generate realistic conversations between multiple speakers directly from a text transcript using speaker tags.
- Non-verbal audio support: Include sounds like laughter, coughing, and sighs in generated audio by adding special tokens in the transcript.
- Emotion and tone control: Guide the emotional delivery of speech through natural language descriptions embedded in the transcript.
- Voice cloning: Provide an audio reference clip to clone a specific voice and use it in generated dialogue.
- Open-source model weights: Download and run the 1.6B parameter model locally via Hugging Face or the GitHub repository.
- Gradio demo: Try Dia instantly through the hosted Hugging Face Spaces demo without any local setup.
- Python API: Integrate Dia into your own applications using the provided Python package and inference scripts.
- Local inference: Run the model on your own hardware for full control over privacy and customization.
Community Discussions
Be the first to start a conversation about Dia
Share your experience with Dia, ask questions, or help others learn from your insights.
Pricing
Open Source
Fully open-source model available for free download and local use.
- 1.6B parameter TTS model
- Multi-speaker dialogue generation
- Voice cloning
- Non-verbal audio cues
- Emotion control
Capabilities
Key Features
- Multi-speaker dialogue generation
- Non-verbal audio cues (laughter, coughing, sighs)
- Emotion and tone control via transcript
- Voice cloning from audio reference
- 1.6B parameter open-source model
- Hugging Face Spaces demo
- Python API
- Local inference support
