# Parlor > On-device, real-time multimodal AI that enables natural voice and vision conversations running entirely on your local machine using Gemma 4 E2B and Kokoro TTS. Parlor is an open-source, on-device multimodal AI assistant that lets you have real-time voice and vision conversations without any cloud dependency. It uses Google's Gemma 4 E2B model for speech and vision understanding, and Kokoro for text-to-speech, all running locally on Apple Silicon or Linux with a supported GPU. The project is designed to eliminate server costs for AI-powered language learning and conversation, with a total end-to-end latency of roughly 2.5–3.0 seconds on an Apple M3 Pro. - **On-device inference**: *Runs entirely on your local machine using LiteRT-LM (GPU) for Gemma 4 E2B and MLX (Mac) or ONNX (Linux) for Kokoro TTS — no cloud API calls required.* - **Real-time voice activity detection**: *Uses Silero VAD in the browser for hands-free, push-to-talk-free conversation.* - **Barge-in support**: *Interrupt the AI mid-sentence by speaking, enabling natural conversational flow.* - **Sentence-level TTS streaming**: *Audio playback begins before the full response is generated, reducing perceived latency.* - **Multimodal vision + speech**: *Point your camera at objects and discuss them in real time; the model processes both audio and JPEG video frames simultaneously.* - **Multilingual support**: *Gemma 4 E2B supports multiple languages, allowing users to fall back to their native language during conversations.* - **FastAPI WebSocket backend**: *A lightweight Python server handles audio PCM and JPEG frame ingestion over WebSocket and streams audio chunks back to the browser.* - **Quick start with uv**: *Clone the repo, run `uv sync` and `uv run server.py`, then open `http://localhost:8000` — models (~2.6 GB) download automatically on first run.* - **Configurable model path and port**: *Override `MODEL_PATH` to use a local model file and `PORT` to change the server port via environment variables.* - **Apache 2.0 licensed**: *Free to use, modify, and distribute.* ## Features - On-device real-time multimodal AI - Voice and vision conversations - Gemma 4 E2B model integration - Kokoro TTS (MLX on Mac, ONNX on Linux) - Browser-based voice activity detection (Silero VAD) - Barge-in interruption support - Sentence-level TTS streaming - FastAPI WebSocket server - Automatic model download on first run - Multilingual support - No cloud dependency - Configurable model path and server port ## Integrations Gemma 4 E2B (Google DeepMind), LiteRT-LM (Google AI Edge), Kokoro TTS (Hexgrad), Silero VAD, HuggingFace, MLX, ONNX, FastAPI ## Platforms MACOS, LINUX, WEB, API, CLI ## Pricing Open Source ## Links - Website: https://github.com/fikrikarim/parlor - Documentation: https://github.com/fikrikarim/parlor - Repository: https://github.com/fikrikarim/parlor - EveryDev.ai: https://www.everydev.ai/tools/parlor