EveryDev.ai
Subscribe
Home
Tools

2,835+ AI tools

  • New
  • Trending
  • Featured
  • Compare
  • Arena
Categories
  • Agents1815
  • Coding1295
  • Infrastructure600
  • Marketing467
  • Projects433
  • Research403
  • Analytics351
  • Design338
  • Security243
  • MCP242
  • Testing238
  • Data230
  • Integration178
  • Prompts160
  • Learning159
  • Communication154
  • Extensions150
  • Voice130
  • Commerce125
  • DevOps108
  • Web80
  • Finance21
AI Tools by Topic
  • AI Coding Assistants
  • Agent Frameworks
  • MCP Servers
  • AI Prompt Tools
  • Vibe Coding Tools
  • AI Design Tools
  • AI Database Tools
  • AI Website Builders
  • AI Testing Tools
  • LLM Evaluations
Follow Us
  • X / Twitter
  • LinkedIn
  • Reddit
  • Discord
  • Threads
  • Bluesky
  • Mastodon
  • YouTube
  • GitHub
  • Instagram
Get Started
  • About
  • Editorial Standards
  • Corrections & Disclosures
  • Community Guidelines
  • Advertise
  • Contact Us
  • Newsletter
  • Submit a Tool
  • Start a Discussion
  • Write A Blog
  • Share A Build
  • Terms of Service
  • Privacy Policy
Explore with AI
  • ChatGPT
  • Gemini
  • Claude
  • Grok
  • Perplexity
Agent Experience
  • llms.txt
Theme
With AI, Everyone is a Dev. EveryDev.ai © 2026
    1. Home
    2. Tools
    3. Rapid-MLX
    Rapid-MLX icon

    Rapid-MLX

    Local Inference

    The fastest local AI inference engine for Apple Silicon Macs, offering OpenAI-compatible API, 17 tool parsers, prompt cache, and 2-4x faster speeds than Ollama.

    Visit Website

    At a Glance

    Pricing
    Open Source

    Fully free and open-source under Apache License 2.0. No cost to use, modify, or distribute.

    Engagement

    Available On

    macOS
    API
    VS Code
    JetBrains
    CLI

    Resources

    WebsiteDocsGitHubllms.txt

    Topics

    Local InferenceAI InfrastructureLLM Orchestration

    Alternatives

    OsaurusMLX-VLMBodega Inference Engine
    Developer
    raullenchaiMenlo Park, CAEst. 2017$80000000 raised

    Listed May 2026

    About Rapid-MLX

    Rapid-MLX is an open-source local AI inference server built specifically for Apple Silicon Macs, leveraging Apple's MLX framework for maximum performance. It provides a drop-in OpenAI-compatible API that works with Cursor, Claude Code, Aider, LangChain, PydanticAI, and any OpenAI-compatible application. With 2-4x faster throughput than Ollama and llama.cpp on most models, it delivers frontier-level AI locally with no cloud costs or API keys required. The project is licensed under Apache 2.0 and supports models ranging from 4B to 158B parameters.

    • OpenAI-Compatible API — Install via pip install rapid-mlx or Homebrew, then rapid-mlx serve <model> to start a server at localhost:8000/v1 that any OpenAI-compatible app can use immediately.
    • 17 Tool Call Parsers — Supports Hermes, Qwen, DeepSeek, Llama, Mistral, GLM, MiniMax, Kimi, and more, with automatic recovery when quantized models produce broken tool call output.
    • Prompt Cache — KV cache trimming for transformer models and DeltaNet RNN state snapshots for hybrid models (Qwen3.5), delivering 2-5x faster Time To First Token on subsequent turns.
    • Reasoning Separation — Chain-of-thought reasoning from models like Qwen3 and DeepSeek-R1 is cleanly separated into a reasoning_content field, streamed independently from the main response.
    • Smart Cloud Routing — Automatically offloads large-context requests to a cloud LLM (GPT-5, Claude, etc.) when local prefill would be too slow, configurable via --cloud-model and --cloud-threshold.
    • Multimodal Support — Vision (Gemma 4, Qwen-VL), audio TTS/STT, video understanding, and text embeddings all served through the same OpenAI-compatible API with optional extras.
    • Model-Harness Index (MHI) — Built-in benchmark combining tool calling (50%), HumanEval (30%), and MMLU (20%) to measure real-world agent performance across 25 model-harness combinations.
    • Wide Client Compatibility — Tested and documented setup for Cursor, Continue.dev, Aider, Open WebUI, LibreChat, PydanticAI, smolagents, LangChain, Hermes Agent, and more.
    • Self-Diagnostics — Run rapid-mlx doctor to verify Metal GPU availability, imports, CLI, and model loading without needing developer tools.
    • 2100+ Tests — Comprehensive pytest unit suite plus stress, soak, and multi-model regression harnesses for production-grade reliability.
    Rapid-MLX - 1

    Community Discussions

    Be the first to start a conversation about Rapid-MLX

    Share your experience with Rapid-MLX, ask questions, or help others learn from your insights.

    Pricing

    OPEN SOURCE

    Open Source

    Fully free and open-source under Apache License 2.0. No cost to use, modify, or distribute.

    • Full local AI inference on Apple Silicon
    • OpenAI-compatible API
    • 17 tool call parsers
    • Prompt cache (KV + DeltaNet snapshots)
    • Vision, audio, and embeddings support

    Capabilities

    Key Features

    • OpenAI-compatible REST API
    • 17 tool call parsers with auto-recovery
    • Prompt cache (KV + DeltaNet RNN state snapshots)
    • Reasoning separation for chain-of-thought models
    • Smart cloud routing for large-context requests
    • Vision/multimodal support (Gemma 4, Qwen-VL)
    • Audio TTS/STT via mlx-audio
    • Text embeddings endpoint
    • Continuous batching
    • KV cache quantization
    • TurboQuant V-cache compression
    • Tool logits bias for jump-forward decoding
    • MCP configuration support
    • Gradio chat UI (optional)
    • Schema-constrained JSON output (outlines)
    • Built-in self-diagnostics (rapid-mlx doctor)
    • Model-Harness Index (MHI) benchmarking
    • 2100+ test suite
    • Homebrew and pip installation
    • Rate limiting and API key authentication

    Integrations

    Cursor
    Claude Code
    Aider
    Continue.dev
    Open WebUI
    LibreChat
    LangChain
    PydanticAI
    smolagents
    Hermes Agent
    OpenClaude
    Goose
    Claw Code
    Anthropic SDK
    OpenAI SDK
    HuggingFace
    Ollama (comparison)
    MCP (Model Context Protocol)
    LiteLLM
    Gradio
    API Available
    View Docs

    Ratings & Reviews

    No ratings yet

    Be the first to rate Rapid-MLX and help others make informed decisions.

    Developer

    raullenchai

    raullenchai builds Rapid-MLX, the fastest local AI inference engine for Apple Silicon Macs. The project leverages Apple's MLX framework to deliver 2-4x faster throughput than Ollama and llama.cpp, with full OpenAI API compatibility and 17 tool call parsers. Rapid-MLX supports models from 4B to 158B parameters and integrates with popular AI coding tools like Cursor, Claude Code, and Aider.

    Founded 2017
    Menlo Park, CA
    $80000000 raised
    100 employees

    Used by

    Samsung Next
    Bosch
    Helium
    Polygon
    +1 more
    Read more about raullenchai
    WebsiteGitHub
    1 tool in directory

    Similar Tools

    Osaurus icon

    Osaurus

    Osaurus is a local-first AI runtime optimized for Apple Silicon that runs open-source models on Mac with privacy and no cloud dependency.

    MLX-VLM icon

    MLX-VLM

    A Python library for running Vision Language Models on Apple Silicon using the MLX framework.

    Bodega Inference Engine icon

    Bodega Inference Engine

    Enterprise-grade local LLM inference engine built specifically for Apple Silicon, featuring a multi-model registry, OpenAI-compatible API, and high-throughput continuous batching.

    Browse all tools

    Related Topics

    Local Inference

    Tools and platforms for running AI inference locally without cloud dependence.

    129 tools

    AI Infrastructure

    Infrastructure designed for deploying and running AI models.

    282 tools

    LLM Orchestration

    Platforms and frameworks for designing, managing, and deploying complex LLM workflows with visual interfaces, allowing for the coordination of multiple AI models and services.

    153 tools
    Browse all topics
    Back to all toolsSuggest an edit
    ratings
    discussions
    48views