Main Menu
  • Tools
  • Developers
  • Topics
  • Discussions
  • Communities
  • News
  • Blogs
  • Builds
  • Contests
  • Compare
  • Arena
Create
    EveryDev.ai
    Sign inSubscribe
    Home
    Tools

    2,226+ AI tools

    • New
    • Trending
    • Featured
    • Compare
    • Arena
    Categories
    • Agents1228
    • Coding1045
    • Infrastructure455
    • Marketing414
    • Design374
    • Projects340
    • Analytics319
    • Research306
    • Testing200
    • Data171
    • Integration169
    • Security169
    • MCP164
    • Learning146
    • Communication131
    • Prompts122
    • Extensions120
    • Commerce116
    • Voice107
    • DevOps92
    • Web73
    • Finance19
    1. Home
    2. Tools
    3. Rapid-MLX
    Rapid-MLX icon

    Rapid-MLX

    Local Inference

    The fastest local AI inference engine for Apple Silicon Macs, offering OpenAI-compatible API, 17 tool parsers, prompt cache, and 2-4x faster speeds than Ollama.

    Visit Website

    At a Glance

    Pricing
    Open Source

    Fully free and open-source under Apache License 2.0. No cost to use, modify, or distribute.

    Engagement

    Available On

    macOS
    API
    VS Code
    JetBrains
    CLI

    Resources

    WebsiteDocsGitHubllms.txt

    Topics

    Local InferenceAI InfrastructureLLM Orchestration

    Alternatives

    SyntheticBodega Inference EngineLemonade
    Developer
    raullenchaiSan Francisco Bay Area, CAEst. 2025

    Listed May 2026

    About Rapid-MLX

    Rapid-MLX is an open-source local AI inference server built specifically for Apple Silicon Macs, leveraging Apple's MLX framework for maximum performance. It provides a drop-in OpenAI-compatible API that works with Cursor, Claude Code, Aider, LangChain, PydanticAI, and any OpenAI-compatible application. With 2-4x faster throughput than Ollama and llama.cpp on most models, it delivers frontier-level AI locally with no cloud costs or API keys required. The project is licensed under Apache 2.0 and supports models ranging from 4B to 158B parameters.

    • OpenAI-Compatible API — Install via pip install rapid-mlx or Homebrew, then rapid-mlx serve <model> to start a server at localhost:8000/v1 that any OpenAI-compatible app can use immediately.
    • 17 Tool Call Parsers — Supports Hermes, Qwen, DeepSeek, Llama, Mistral, GLM, MiniMax, Kimi, and more, with automatic recovery when quantized models produce broken tool call output.
    • Prompt Cache — KV cache trimming for transformer models and DeltaNet RNN state snapshots for hybrid models (Qwen3.5), delivering 2-5x faster Time To First Token on subsequent turns.
    • Reasoning Separation — Chain-of-thought reasoning from models like Qwen3 and DeepSeek-R1 is cleanly separated into a reasoning_content field, streamed independently from the main response.
    • Smart Cloud Routing — Automatically offloads large-context requests to a cloud LLM (GPT-5, Claude, etc.) when local prefill would be too slow, configurable via --cloud-model and --cloud-threshold.
    • Multimodal Support — Vision (Gemma 4, Qwen-VL), audio TTS/STT, video understanding, and text embeddings all served through the same OpenAI-compatible API with optional extras.
    • Model-Harness Index (MHI) — Built-in benchmark combining tool calling (50%), HumanEval (30%), and MMLU (20%) to measure real-world agent performance across 25 model-harness combinations.
    • Wide Client Compatibility — Tested and documented setup for Cursor, Continue.dev, Aider, Open WebUI, LibreChat, PydanticAI, smolagents, LangChain, Hermes Agent, and more.
    • Self-Diagnostics — Run rapid-mlx doctor to verify Metal GPU availability, imports, CLI, and model loading without needing developer tools.
    • 2100+ Tests — Comprehensive pytest unit suite plus stress, soak, and multi-model regression harnesses for production-grade reliability.
    Rapid-MLX - 1

    Community Discussions

    Be the first to start a conversation about Rapid-MLX

    Share your experience with Rapid-MLX, ask questions, or help others learn from your insights.

    Pricing

    OPEN SOURCE

    Open Source

    Fully free and open-source under Apache License 2.0. No cost to use, modify, or distribute.

    • Full local AI inference on Apple Silicon
    • OpenAI-compatible API
    • 17 tool call parsers
    • Prompt cache (KV + DeltaNet snapshots)
    • Vision, audio, and embeddings support

    Capabilities

    Key Features

    • OpenAI-compatible REST API
    • 17 tool call parsers with auto-recovery
    • Prompt cache (KV + DeltaNet RNN state snapshots)
    • Reasoning separation for chain-of-thought models
    • Smart cloud routing for large-context requests
    • Vision/multimodal support (Gemma 4, Qwen-VL)
    • Audio TTS/STT via mlx-audio
    • Text embeddings endpoint
    • Continuous batching
    • KV cache quantization
    • TurboQuant V-cache compression
    • Tool logits bias for jump-forward decoding
    • MCP configuration support
    • Gradio chat UI (optional)
    • Schema-constrained JSON output (outlines)
    • Built-in self-diagnostics (rapid-mlx doctor)
    • Model-Harness Index (MHI) benchmarking
    • 2100+ test suite
    • Homebrew and pip installation
    • Rate limiting and API key authentication

    Integrations

    Cursor
    Claude Code
    Aider
    Continue.dev
    Open WebUI
    LibreChat
    LangChain
    PydanticAI
    smolagents
    Hermes Agent
    OpenClaude
    Goose
    Claw Code
    Anthropic SDK
    OpenAI SDK
    HuggingFace
    Ollama (comparison)
    MCP (Model Context Protocol)
    LiteLLM
    Gradio
    API Available
    View Docs

    Reviews & Ratings

    No ratings yet

    Be the first to rate Rapid-MLX and help others make informed decisions.

    Developer

    raullenchai

    raullenchai builds Rapid-MLX, the fastest local AI inference engine for Apple Silicon Macs. The project leverages Apple's MLX framework to deliver 2-4x faster throughput than Ollama and llama.cpp, with full OpenAI API compatibility and 17 tool call parsers. Rapid-MLX supports models from 4B to 158B parameters and integrates with popular AI coding tools like Cursor, Claude Code, and Aider.

    Founded 2025
    San Francisco Bay Area, CA
    1 employees

    Used by

    Cursor (compatible)
    Claude Code (compatible)
    Aider (compatible)
    PydanticAI (integrated)
    Read more about raullenchai
    WebsiteGitHub
    1 tool in directory

    Similar Tools

    Synthetic icon

    Synthetic

    AI platform providing access to multiple LLMs with subscription or usage-based pricing, offering both UI and API access.

    Bodega Inference Engine icon

    Bodega Inference Engine

    Enterprise-grade local LLM inference engine built specifically for Apple Silicon, featuring a multi-model registry, OpenAI-compatible API, and high-throughput continuous batching.

    Lemonade icon

    Lemonade

    Open-source local LLM server for Windows, Linux, and macOS that runs LLMs, image generation, speech, and more on GPUs and NPUs with an OpenAI-compatible API.

    Browse all tools

    Related Topics

    Local Inference

    Tools and platforms for running AI inference locally without cloud dependence.

    91 tools

    AI Infrastructure

    Infrastructure designed for deploying and running AI models.

    212 tools

    LLM Orchestration

    Platforms and frameworks for designing, managing, and deploying complex LLM workflows with visual interfaces, allowing for the coordination of multiple AI models and services.

    107 tools
    Browse all topics
    Back to all tools
    Explore AI Tools
    • AI Coding Assistants
    • Agent Frameworks
    • MCP Servers
    • AI Prompt Tools
    • Vibe Coding Tools
    • AI Design Tools
    • AI Database Tools
    • AI Website Builders
    • AI Testing Tools
    • LLM Evaluations
    Follow Us
    • X / Twitter
    • LinkedIn
    • Reddit
    • Discord
    • Threads
    • Bluesky
    • Mastodon
    • YouTube
    • GitHub
    • Instagram
    Get Started
    • About
    • Editorial Standards
    • Corrections & Disclosures
    • Community Guidelines
    • Advertise
    • Contact Us
    • Newsletter
    • Submit a Tool
    • Start a Discussion
    • Write A Blog
    • Share A Build
    • Terms of Service
    • Privacy Policy
    Explore with AI
    • ChatGPT
    • Gemini
    • Claude
    • Grok
    • Perplexity
    Agent Experience
    • llms.txt
    Theme
    With AI, Everyone is a Dev. EveryDev.ai © 2026
    Discussions