# Miso TTS 8B

> An 8-billion parameter open-source text-to-speech model designed for high-quality, highly emotive conversational speech generation with voice cloning support.

Miso TTS 8B is an open-source, 8-billion parameter text-to-dialogue model built by Miso Labs (Kamino Learning, Inc.) for high-quality conversational speech synthesis. The model is available on GitHub and Hugging Face, and can be run locally on CUDA-capable hardware. A live demo is hosted on the Miso Labs landing page at misolabs.ai.

## What It Is

Miso TTS 8B is a text-to-speech model in the RVQ (Residual Vector Quantization) Transformer category, inspired by the Sesame CSM architecture. It generates Mimi audio codes from text and optional audio context, making it suitable for conversational speech generation rather than simple single-utterance synthesis. The model currently supports English only.

## Architecture

The model uses two transformer components working in tandem:

- **Backbone transformer (Llama 8B):** Consumes interleaved text and audio-frame embeddings, conditioning generation on conversation history.
- **Audio decoder transformer (Llama 300M):** Autoregressively predicts higher-order audio codebooks within each frame.

Key model specs include a text vocabulary of 128,256 tokens, an audio vocabulary of 2,051 tokens, 32 audio codebooks, the Mimi audio tokenizer, and a maximum sequence length of 2,048. Default inference uses `torch.bfloat16` precision.

## Voice Cloning and Prompted Generation

Miso TTS 8B supports optional prompted generation, allowing the model to condition on prior audio for voice cloning. Users supply a `Segment` object containing a speaker ID, transcript, and audio waveform as context. Without a prompt, the model generates speech from text alone. Generated audio is watermarked by default using the SilentCipher watermarking model from Sony.

## Setup Path

The repository supports two installation paths:

- **uv (recommended):** Clone the repo, run `uv sync --python 3.10`, activate the virtual environment, and run `uv run python run_misotts.py`.
- **pip:** Create a Python 3.10 venv, install with `pip install -e .`, and run `python run_misotts.py`.

Model weights are hosted publicly on Hugging Face at `MisoLabs/MisoTTS` and are downloaded automatically on first run via the Hugging Face Hub cache.

## Deployment Notes and Safety

The model requires a CUDA GPU with sufficient VRAM for the checkpoint precision being loaded. The repository notes that Miso TTS 8B is a large model and recommends GPU inference for best results. The project's safety guidelines explicitly prohibit using the model to impersonate people, create deceptive audio, commit fraud, or generate harmful content. Deployers are advised to use their own private watermark key.

## Current Status

The GitHub repository was created in May 2026 and last updated in early June 2026, with 1,662 stars and 134 forks as reported by the repository metadata. The project is released under a Modified MIT License, with a commercial attribution clause applying to products exceeding 50 million monthly active users or $10 million USD in monthly revenue.

## Features
- 8B parameter text-to-speech model
- High-quality conversational speech generation
- Voice cloning via prompted generation
- RVQ Transformer architecture
- Llama 8B backbone with Llama 300M audio decoder
- Mimi audio tokenizer
- 32 audio codebooks
- SilentCipher audio watermarking
- Hugging Face model hosting
- Local inference support
- Python API
- English language support

## Integrations
Hugging Face Hub, PyTorch, torchaudio, SilentCipher (Sony), uv

## Platforms
CLI, API, DEVELOPER_SDK

## Pricing
Open Source

## Version
8B

## Links
- Website: https://github.com/MisoLabsAI/MisoTTS
- Documentation: https://misolabs.ai/blog/miso-tts-8b
- Repository: https://github.com/MisoLabsAI/MisoTTS
- EveryDev.ai: https://www.everydev.ai/tools/miso-tts-8b
