Rapid-MLX
The fastest local AI inference engine for Apple Silicon Macs, offering OpenAI-compatible API, 17 tool parsers, prompt cache, and 2-4x faster speeds than Ollama.
At a Glance
About Rapid-MLX
Rapid-MLX is an open-source local AI inference server built specifically for Apple Silicon Macs, leveraging Apple's MLX framework for maximum performance. It provides a drop-in OpenAI-compatible API that works with Cursor, Claude Code, Aider, LangChain, PydanticAI, and any OpenAI-compatible application. With 2-4x faster throughput than Ollama and llama.cpp on most models, it delivers frontier-level AI locally with no cloud costs or API keys required. The project is licensed under Apache 2.0 and supports models ranging from 4B to 158B parameters.
- OpenAI-Compatible API — Install via
pip install rapid-mlxor Homebrew, thenrapid-mlx serve <model>to start a server atlocalhost:8000/v1that any OpenAI-compatible app can use immediately. - 17 Tool Call Parsers — Supports Hermes, Qwen, DeepSeek, Llama, Mistral, GLM, MiniMax, Kimi, and more, with automatic recovery when quantized models produce broken tool call output.
- Prompt Cache — KV cache trimming for transformer models and DeltaNet RNN state snapshots for hybrid models (Qwen3.5), delivering 2-5x faster Time To First Token on subsequent turns.
- Reasoning Separation — Chain-of-thought reasoning from models like Qwen3 and DeepSeek-R1 is cleanly separated into a
reasoning_contentfield, streamed independently from the main response. - Smart Cloud Routing — Automatically offloads large-context requests to a cloud LLM (GPT-5, Claude, etc.) when local prefill would be too slow, configurable via
--cloud-modeland--cloud-threshold. - Multimodal Support — Vision (Gemma 4, Qwen-VL), audio TTS/STT, video understanding, and text embeddings all served through the same OpenAI-compatible API with optional extras.
- Model-Harness Index (MHI) — Built-in benchmark combining tool calling (50%), HumanEval (30%), and MMLU (20%) to measure real-world agent performance across 25 model-harness combinations.
- Wide Client Compatibility — Tested and documented setup for Cursor, Continue.dev, Aider, Open WebUI, LibreChat, PydanticAI, smolagents, LangChain, Hermes Agent, and more.
- Self-Diagnostics — Run
rapid-mlx doctorto verify Metal GPU availability, imports, CLI, and model loading without needing developer tools. - 2100+ Tests — Comprehensive pytest unit suite plus stress, soak, and multi-model regression harnesses for production-grade reliability.
Community Discussions
Be the first to start a conversation about Rapid-MLX
Share your experience with Rapid-MLX, ask questions, or help others learn from your insights.
Pricing
Open Source
Fully free and open-source under Apache License 2.0. No cost to use, modify, or distribute.
- Full local AI inference on Apple Silicon
- OpenAI-compatible API
- 17 tool call parsers
- Prompt cache (KV + DeltaNet snapshots)
- Vision, audio, and embeddings support
Capabilities
Key Features
- OpenAI-compatible REST API
- 17 tool call parsers with auto-recovery
- Prompt cache (KV + DeltaNet RNN state snapshots)
- Reasoning separation for chain-of-thought models
- Smart cloud routing for large-context requests
- Vision/multimodal support (Gemma 4, Qwen-VL)
- Audio TTS/STT via mlx-audio
- Text embeddings endpoint
- Continuous batching
- KV cache quantization
- TurboQuant V-cache compression
- Tool logits bias for jump-forward decoding
- MCP configuration support
- Gradio chat UI (optional)
- Schema-constrained JSON output (outlines)
- Built-in self-diagnostics (rapid-mlx doctor)
- Model-Harness Index (MHI) benchmarking
- 2100+ test suite
- Homebrew and pip installation
- Rate limiting and API key authentication
