# Parlor

> On-device, real-time multimodal AI that enables natural voice and vision conversations running entirely on your local machine using Gemma 4 E2B and Kokoro TTS.

Parlor is an open-source, on-device multimodal AI assistant that lets you have real-time voice and vision conversations without any cloud dependency. It uses Google's Gemma 4 E2B model for speech and vision understanding, and Kokoro for text-to-speech, all running locally on Apple Silicon or Linux with a supported GPU. The project is designed to eliminate server costs for AI-powered language learning and conversation, with a total end-to-end latency of roughly 2.5–3.0 seconds on an Apple M3 Pro.

- **On-device inference**: *Runs entirely on your local machine using LiteRT-LM (GPU) for Gemma 4 E2B and MLX (Mac) or ONNX (Linux) for Kokoro TTS — no cloud API calls required.*
- **Real-time voice activity detection**: *Uses Silero VAD in the browser for hands-free, push-to-talk-free conversation.*
- **Barge-in support**: *Interrupt the AI mid-sentence by speaking, enabling natural conversational flow.*
- **Sentence-level TTS streaming**: *Audio playback begins before the full response is generated, reducing perceived latency.*
- **Multimodal vision + speech**: *Point your camera at objects and discuss them in real time; the model processes both audio and JPEG video frames simultaneously.*
- **Multilingual support**: *Gemma 4 E2B supports multiple languages, allowing users to fall back to their native language during conversations.*
- **FastAPI WebSocket backend**: *A lightweight Python server handles audio PCM and JPEG frame ingestion over WebSocket and streams audio chunks back to the browser.*
- **Quick start with uv**: *Clone the repo, run `uv sync` and `uv run server.py`, then open `http://localhost:8000` — models (~2.6 GB) download automatically on first run.*
- **Configurable model path and port**: *Override `MODEL_PATH` to use a local model file and `PORT` to change the server port via environment variables.*
- **Apache 2.0 licensed**: *Free to use, modify, and distribute.*

## Features
- On-device real-time multimodal AI
- Voice and vision conversations
- Gemma 4 E2B model integration
- Kokoro TTS (MLX on Mac, ONNX on Linux)
- Browser-based voice activity detection (Silero VAD)
- Barge-in interruption support
- Sentence-level TTS streaming
- FastAPI WebSocket server
- Automatic model download on first run
- Multilingual support
- No cloud dependency
- Configurable model path and server port

## Integrations
Gemma 4 E2B (Google DeepMind), LiteRT-LM (Google AI Edge), Kokoro TTS (Hexgrad), Silero VAD, HuggingFace, MLX, ONNX, FastAPI

## Platforms
MACOS, LINUX, WEB, API, CLI

## Pricing
Open Source

## Links
- Website: https://github.com/fikrikarim/parlor
- Documentation: https://github.com/fikrikarim/parlor
- Repository: https://github.com/fikrikarim/parlor
- EveryDev.ai: https://www.everydev.ai/tools/parlor