Rapid-MLX

Name: Rapid-MLX
Availability: OnlineOnly
Author: raullenchai

The fastest local AI inference engine for Apple Silicon Macs, offering OpenAI-compatible API, 17 tool parsers, prompt cache, and 2-4x faster speeds than Ollama.

Visit Website

At a Glance

Pricing

Open Source

Fully free and open-source under Apache License 2.0. No cost to use, modify, or distribute.

Engagement

Available On

macOS

API

VS Code

JetBrains

CLI

raullenchaiSan Francisco Bay Area, CAEst. 2025

Listed May 2026

About Rapid-MLX

Rapid-MLX is an open-source local AI inference server built specifically for Apple Silicon Macs, leveraging Apple's MLX framework for maximum performance. It provides a drop-in OpenAI-compatible API that works with Cursor, Claude Code, Aider, LangChain, PydanticAI, and any OpenAI-compatible application. With 2-4x faster throughput than Ollama and llama.cpp on most models, it delivers frontier-level AI locally with no cloud costs or API keys required. The project is licensed under Apache 2.0 and supports models ranging from 4B to 158B parameters.

OpenAI-Compatible API — Install via pip install rapid-mlx or Homebrew, then rapid-mlx serve <model> to start a server at localhost:8000/v1 that any OpenAI-compatible app can use immediately.
17 Tool Call Parsers — Supports Hermes, Qwen, DeepSeek, Llama, Mistral, GLM, MiniMax, Kimi, and more, with automatic recovery when quantized models produce broken tool call output.
Prompt Cache — KV cache trimming for transformer models and DeltaNet RNN state snapshots for hybrid models (Qwen3.5), delivering 2-5x faster Time To First Token on subsequent turns.
Reasoning Separation — Chain-of-thought reasoning from models like Qwen3 and DeepSeek-R1 is cleanly separated into a reasoning_content field, streamed independently from the main response.
Smart Cloud Routing — Automatically offloads large-context requests to a cloud LLM (GPT-5, Claude, etc.) when local prefill would be too slow, configurable via --cloud-model and --cloud-threshold.
Multimodal Support — Vision (Gemma 4, Qwen-VL), audio TTS/STT, video understanding, and text embeddings all served through the same OpenAI-compatible API with optional extras.
Model-Harness Index (MHI) — Built-in benchmark combining tool calling (50%), HumanEval (30%), and MMLU (20%) to measure real-world agent performance across 25 model-harness combinations.
Wide Client Compatibility — Tested and documented setup for Cursor, Continue.dev, Aider, Open WebUI, LibreChat, PydanticAI, smolagents, LangChain, Hermes Agent, and more.
Self-Diagnostics — Run rapid-mlx doctor to verify Metal GPU availability, imports, CLI, and model loading without needing developer tools.
2100+ Tests — Comprehensive pytest unit suite plus stress, soak, and multi-model regression harnesses for production-grade reliability.

Community Discussions

Be the first to start a conversation about Rapid-MLX

Share your experience with Rapid-MLX, ask questions, or help others learn from your insights.

Pricing

OPEN SOURCE

Open Source

Fully free and open-source under Apache License 2.0. No cost to use, modify, or distribute.

Full local AI inference on Apple Silicon
OpenAI-compatible API
17 tool call parsers
Prompt cache (KV + DeltaNet snapshots)
Vision, audio, and embeddings support

Capabilities

Key Features

OpenAI-compatible REST API
17 tool call parsers with auto-recovery
Prompt cache (KV + DeltaNet RNN state snapshots)
Reasoning separation for chain-of-thought models
Smart cloud routing for large-context requests
Vision/multimodal support (Gemma 4, Qwen-VL)
Audio TTS/STT via mlx-audio
Text embeddings endpoint
Continuous batching
KV cache quantization
TurboQuant V-cache compression
Tool logits bias for jump-forward decoding
MCP configuration support
Gradio chat UI (optional)
Schema-constrained JSON output (outlines)
Built-in self-diagnostics (rapid-mlx doctor)
Model-Harness Index (MHI) benchmarking
2100+ test suite
Homebrew and pip installation
Rate limiting and API key authentication

Integrations

Cursor

Claude Code

Aider

Continue.dev

Open WebUI

LibreChat

LangChain

PydanticAI

smolagents

Hermes Agent

OpenClaude

Goose

Claw Code

Anthropic SDK

OpenAI SDK

HuggingFace

Ollama (comparison)

MCP (Model Context Protocol)

LiteLLM

Gradio

API Available

View Docs

Back to all tools

Rapid-MLX

Local Inference

The fastest local AI inference engine for Apple Silicon Macs, offering OpenAI-compatible API, 17 tool parsers, prompt cache, and 2-4x faster speeds than Ollama.

Visit Website

At a Glance

Pricing

Open Source

Fully free and open-source under Apache License 2.0. No cost to use, modify, or distribute.

Engagement

Discussions

Available On

macOS

API

VS Code

JetBrains

CLI

Resources

Website Docs GitHub llms.txt

Topics

Local Inference AI Infrastructure LLM Orchestration

Alternatives

Synthetic Bodega Inference Engine Lemonade

Developer

raullenchaiSan Francisco Bay Area, CAEst. 2025

Listed May 2026

About Rapid-MLX

OpenAI-Compatible API — Install via pip install rapid-mlx or Homebrew, then rapid-mlx serve <model> to start a server at localhost:8000/v1 that any OpenAI-compatible app can use immediately.
17 Tool Call Parsers — Supports Hermes, Qwen, DeepSeek, Llama, Mistral, GLM, MiniMax, Kimi, and more, with automatic recovery when quantized models produce broken tool call output.
Prompt Cache — KV cache trimming for transformer models and DeltaNet RNN state snapshots for hybrid models (Qwen3.5), delivering 2-5x faster Time To First Token on subsequent turns.
Reasoning Separation — Chain-of-thought reasoning from models like Qwen3 and DeepSeek-R1 is cleanly separated into a reasoning_content field, streamed independently from the main response.
Smart Cloud Routing — Automatically offloads large-context requests to a cloud LLM (GPT-5, Claude, etc.) when local prefill would be too slow, configurable via --cloud-model and --cloud-threshold.
Multimodal Support — Vision (Gemma 4, Qwen-VL), audio TTS/STT, video understanding, and text embeddings all served through the same OpenAI-compatible API with optional extras.
Model-Harness Index (MHI) — Built-in benchmark combining tool calling (50%), HumanEval (30%), and MMLU (20%) to measure real-world agent performance across 25 model-harness combinations.
Wide Client Compatibility — Tested and documented setup for Cursor, Continue.dev, Aider, Open WebUI, LibreChat, PydanticAI, smolagents, LangChain, Hermes Agent, and more.
Self-Diagnostics — Run rapid-mlx doctor to verify Metal GPU availability, imports, CLI, and model loading without needing developer tools.
2100+ Tests — Comprehensive pytest unit suite plus stress, soak, and multi-model regression harnesses for production-grade reliability.

Community Discussions

Be the first to start a conversation about Rapid-MLX

Share your experience with Rapid-MLX, ask questions, or help others learn from your insights.

Pricing

OPEN SOURCE

Open Source

Fully free and open-source under Apache License 2.0. No cost to use, modify, or distribute.

Full local AI inference on Apple Silicon
OpenAI-compatible API
17 tool call parsers
Prompt cache (KV + DeltaNet snapshots)
Vision, audio, and embeddings support

Capabilities

Key Features

OpenAI-compatible REST API
17 tool call parsers with auto-recovery
Prompt cache (KV + DeltaNet RNN state snapshots)
Reasoning separation for chain-of-thought models
Smart cloud routing for large-context requests
Vision/multimodal support (Gemma 4, Qwen-VL)
Audio TTS/STT via mlx-audio
Text embeddings endpoint
Continuous batching
KV cache quantization
TurboQuant V-cache compression
Tool logits bias for jump-forward decoding
MCP configuration support
Gradio chat UI (optional)
Schema-constrained JSON output (outlines)
Built-in self-diagnostics (rapid-mlx doctor)
Model-Harness Index (MHI) benchmarking
2100+ test suite
Homebrew and pip installation
Rate limiting and API key authentication

Integrations

Cursor

Claude Code

Aider

Continue.dev

Open WebUI

LibreChat

LangChain

PydanticAI

smolagents

Hermes Agent

OpenClaude

Goose

Claw Code

Anthropic SDK

OpenAI SDK

HuggingFace

Ollama (comparison)

MCP (Model Context Protocol)

LiteLLM

Gradio

API Available

View Docs

Back to all tools