Miso TTS 8B

Name: Miso TTS 8B
Availability: OnlineOnly
Author: Miso Labs

An 8-billion parameter open-source text-to-speech model designed for high-quality, highly emotive conversational speech generation with voice cloning support.

Visit Website

At a Glance

Pricing

Open Source

Free to use under Modified MIT License. Run locally or access via Hugging Face.

Engagement

Available On

CLI

API

SDK

Miso LabsSan Francisco, CAEst. 2025$8M raised

Listed Jun 2026

About Miso TTS 8B

Miso TTS 8B is an open-source, 8-billion parameter text-to-dialogue model built by Miso Labs (Kamino Learning, Inc.) for high-quality conversational speech synthesis. The model is available on GitHub and Hugging Face, and can be run locally on CUDA-capable hardware. A live demo is hosted on the Miso Labs landing page at misolabs.ai.

What It Is

Miso TTS 8B is a text-to-speech model in the RVQ (Residual Vector Quantization) Transformer category, inspired by the Sesame CSM architecture. It generates Mimi audio codes from text and optional audio context, making it suitable for conversational speech generation rather than simple single-utterance synthesis. The model currently supports English only.

Architecture

The model uses two transformer components working in tandem:

Backbone transformer (Llama 8B): Consumes interleaved text and audio-frame embeddings, conditioning generation on conversation history.
Audio decoder transformer (Llama 300M): Autoregressively predicts higher-order audio codebooks within each frame.

Key model specs include a text vocabulary of 128,256 tokens, an audio vocabulary of 2,051 tokens, 32 audio codebooks, the Mimi audio tokenizer, and a maximum sequence length of 2,048. Default inference uses torch.bfloat16 precision.

Voice Cloning and Prompted Generation

Miso TTS 8B supports optional prompted generation, allowing the model to condition on prior audio for voice cloning. Users supply a Segment object containing a speaker ID, transcript, and audio waveform as context. Without a prompt, the model generates speech from text alone. Generated audio is watermarked by default using the SilentCipher watermarking model from Sony.

Setup Path

The repository supports two installation paths:

uv (recommended): Clone the repo, run uv sync --python 3.10, activate the virtual environment, and run uv run python run_misotts.py.
pip: Create a Python 3.10 venv, install with pip install -e ., and run python run_misotts.py.

Model weights are hosted publicly on Hugging Face at MisoLabs/MisoTTS and are downloaded automatically on first run via the Hugging Face Hub cache.

Deployment Notes and Safety

The model requires a CUDA GPU with sufficient VRAM for the checkpoint precision being loaded. The repository notes that Miso TTS 8B is a large model and recommends GPU inference for best results. The project's safety guidelines explicitly prohibit using the model to impersonate people, create deceptive audio, commit fraud, or generate harmful content. Deployers are advised to use their own private watermark key.

Current Status

The GitHub repository was created in May 2026 and last updated in early June 2026, with 1,662 stars and 134 forks as reported by the repository metadata. The project is released under a Modified MIT License, with a commercial attribution clause applying to products exceeding 50 million monthly active users or $10 million USD in monthly revenue.

Community Discussions

Be the first to start a conversation about Miso TTS 8B

Share your experience with Miso TTS 8B, ask questions, or help others learn from your insights.

Pricing

OPEN SOURCE

Open Source

Free to use under Modified MIT License. Run locally or access via Hugging Face.

Full model weights on Hugging Face
Local inference via Python
Voice cloning support
Audio watermarking
Commercial use allowed with attribution for large-scale deployments

Capabilities

Key Features

8B parameter text-to-speech model
High-quality conversational speech generation
Voice cloning via prompted generation
RVQ Transformer architecture
Llama 8B backbone with Llama 300M audio decoder
Mimi audio tokenizer
32 audio codebooks
SilentCipher audio watermarking
Hugging Face model hosting
Local inference support
Python API
English language support

Integrations

Hugging Face Hub

PyTorch

torchaudio

SilentCipher (Sony)

API Available

View Docs

Back to all tools Suggest an edit