# Rapid-MLX

> The fastest local AI inference engine for Apple Silicon Macs, offering OpenAI-compatible API, 17 tool parsers, prompt cache, and 2-4x faster speeds than Ollama.

Rapid-MLX is an open-source local AI inference server built specifically for Apple Silicon Macs, leveraging Apple's MLX framework for maximum performance. It provides a drop-in OpenAI-compatible API that works with Cursor, Claude Code, Aider, LangChain, PydanticAI, and any OpenAI-compatible application. With 2-4x faster throughput than Ollama and llama.cpp on most models, it delivers frontier-level AI locally with no cloud costs or API keys required. The project is licensed under Apache 2.0 and supports models ranging from 4B to 158B parameters.

- **OpenAI-Compatible API** — *Install via `pip install rapid-mlx` or Homebrew, then `rapid-mlx serve <model>` to start a server at `localhost:8000/v1` that any OpenAI-compatible app can use immediately.*
- **17 Tool Call Parsers** — *Supports Hermes, Qwen, DeepSeek, Llama, Mistral, GLM, MiniMax, Kimi, and more, with automatic recovery when quantized models produce broken tool call output.*
- **Prompt Cache** — *KV cache trimming for transformer models and DeltaNet RNN state snapshots for hybrid models (Qwen3.5), delivering 2-5x faster Time To First Token on subsequent turns.*
- **Reasoning Separation** — *Chain-of-thought reasoning from models like Qwen3 and DeepSeek-R1 is cleanly separated into a `reasoning_content` field, streamed independently from the main response.*
- **Smart Cloud Routing** — *Automatically offloads large-context requests to a cloud LLM (GPT-5, Claude, etc.) when local prefill would be too slow, configurable via `--cloud-model` and `--cloud-threshold`.*
- **Multimodal Support** — *Vision (Gemma 4, Qwen-VL), audio TTS/STT, video understanding, and text embeddings all served through the same OpenAI-compatible API with optional extras.*
- **Model-Harness Index (MHI)** — *Built-in benchmark combining tool calling (50%), HumanEval (30%), and MMLU (20%) to measure real-world agent performance across 25 model-harness combinations.*
- **Wide Client Compatibility** — *Tested and documented setup for Cursor, Continue.dev, Aider, Open WebUI, LibreChat, PydanticAI, smolagents, LangChain, Hermes Agent, and more.*
- **Self-Diagnostics** — *Run `rapid-mlx doctor` to verify Metal GPU availability, imports, CLI, and model loading without needing developer tools.*
- **2100+ Tests** — *Comprehensive pytest unit suite plus stress, soak, and multi-model regression harnesses for production-grade reliability.*

## Features
- OpenAI-compatible REST API
- 17 tool call parsers with auto-recovery
- Prompt cache (KV + DeltaNet RNN state snapshots)
- Reasoning separation for chain-of-thought models
- Smart cloud routing for large-context requests
- Vision/multimodal support (Gemma 4, Qwen-VL)
- Audio TTS/STT via mlx-audio
- Text embeddings endpoint
- Continuous batching
- KV cache quantization
- TurboQuant V-cache compression
- Tool logits bias for jump-forward decoding
- MCP configuration support
- Gradio chat UI (optional)
- Schema-constrained JSON output (outlines)
- Built-in self-diagnostics (rapid-mlx doctor)
- Model-Harness Index (MHI) benchmarking
- 2100+ test suite
- Homebrew and pip installation
- Rate limiting and API key authentication

## Integrations
Cursor, Claude Code, Aider, Continue.dev, Open WebUI, LibreChat, LangChain, PydanticAI, smolagents, Hermes Agent, OpenClaude, Goose, Claw Code, Anthropic SDK, OpenAI SDK, HuggingFace, Ollama (comparison), MCP (Model Context Protocol), LiteLLM, Gradio

## Platforms
MACOS, API, VSC_EXTENSION, JETBRAINS_PLUGIN, CLI

## Pricing
Open Source

## Version
v0.6.15

## Links
- Website: https://pypi.org/project/rapid-mlx
- Documentation: https://github.com/raullenchai/Rapid-MLX
- Repository: https://github.com/raullenchai/Rapid-MLX
- EveryDev.ai: https://www.everydev.ai/tools/rapid-mlx