EveryDev.ai
Subscribe
Home
Tools

3,020+ AI tools

  • New
  • Trending
  • Featured
  • Compare
  • Arena
Categories
  • Agents2063
  • Coding1441
  • Infrastructure665
  • Marketing524
  • Projects470
  • Research437
  • Design408
  • Analytics371
  • MCP268
  • Security265
  • Testing255
  • Data249
  • Integration183
  • Prompts183
  • Communication172
  • Learning166
  • Extensions163
  • Voice146
  • Commerce132
  • DevOps115
  • Web84
  • Finance24
AI Tools by Topic
  • AI Coding Assistants
  • Agent Frameworks
  • MCP Servers
  • AI Prompt Tools
  • Vibe Coding Tools
  • AI Design Tools
  • AI Database Tools
  • AI Website Builders
  • AI Testing Tools
  • LLM Evaluations
Follow Us
  • X / Twitter
  • LinkedIn
  • Reddit
  • Discord
  • Threads
  • Bluesky
  • Mastodon
  • YouTube
  • GitHub
  • Instagram
Get Started
  • About
  • Editorial Standards
  • Corrections & Disclosures
  • Community Guidelines
  • Advertise
  • Contact Us
  • Newsletter
  • Submit a Tool
  • Start a Discussion
  • Write A Blog
  • Share A Build
  • Terms of Service
  • Privacy Policy
Explore with AI
  • ChatGPT
  • Gemini
  • Claude
  • Grok
  • Perplexity
Agent Experience
  • llms.txt
Theme
With AI, Everyone is a Dev. EveryDev.ai © 2026
    1. Home
    2. Tools
    3. oMLX
    oMLX icon

    oMLX

    Local Inference
    Featured

    macOS-native LLM inference server for Apple Silicon with continuous batching and tiered SSD KV caching, managed from the menu bar.

    Visit Website

    At a Glance

    Pricing
    Open Source

    Fully free and open source under Apache 2.0. Download the macOS app, install via Homebrew, or build from source.

    Engagement

    Available On

    macOS
    Web
    API
    CLI

    Resources

    WebsiteDocsGitHubllms.txt

    Topics

    Local InferenceAI InfrastructureAI Coding Assistants

    Alternatives

    Inception LabsDeepTideBodega Inference Engine
    Developer
    Jun KimSeoul, KoreaEst. 2026

    Listed Jul 2026

    About oMLX

    oMLX is an open-source LLM inference server built specifically for Apple Silicon Macs, released under the Apache 2.0 license. It addresses a core pain point for local AI coding workflows: KV cache invalidation that forces long recomputation waits every time a coding agent revisits a previous context. The project is maintained by jundot and has accumulated over 17,000 GitHub stars since its creation in early 2026.

    What It Is

    oMLX is a macOS-native server that runs large language models locally using Apple's MLX framework, with a two-tier KV cache architecture (hot RAM + cold SSD) that persists cache blocks across requests and server restarts. It exposes both OpenAI-compatible (/v1/chat/completions) and Anthropic-compatible (/v1/messages) API endpoints, making it a drop-in backend for tools like Claude Code, OpenClaw, Cursor, OpenCode, and Codex. The project started from vllm-mlx v0.1.0 and evolved significantly with multi-model serving, tiered KV caching, VLM support, an admin panel, and a native macOS menu bar app.

    Architecture and Caching Design

    The core innovation is a block-based paged KV cache inspired by vLLM, operating across two tiers:

    • Hot tier (RAM): Frequently accessed cache blocks stay in memory for fast access, with Copy-on-Write and prefix sharing.
    • Cold tier (SSD): When the hot cache fills, blocks are offloaded to SSD in safetensors format. On the next request with a matching prefix, they are restored from disk rather than recomputed — even after a server restart.

    The server architecture layers a FastAPI server over an EnginePool (supporting BatchedEngine, VLMEngine, EmbeddingEngine, and RerankerEngine), a ProcessMemoryEnforcer, an FCFS Scheduler using mlx-lm's BatchGenerator, and the full cache stack.

    Supported Models and Tool Calling

    oMLX serves any MLX-format model from HuggingFace, including Qwen, LLaMA, Mistral, Gemma, DeepSeek, MiniMax, GLM, and more. It supports text LLMs, vision-language models (VLMs), OCR models (DeepSeek-OCR, DOTS-OCR, GLM-OCR), embedding models (BERT, BGE-M3, ModernBERT), and rerankers. Tool calling is auto-detected across all major formats: JSON <tool_call>, Qwen3.5 XML, Gemma, GLM, MiniMax, Mistral, Kimi K2, and Longcat. MCP (Model Context Protocol) tool integration is also supported.

    macOS App and Admin Dashboard

    The macOS app is a native Swift/SwiftUI menubar application — not Electron — that starts, stops, and monitors the server without opening a terminal. It includes persistent serving stats, auto-restart on crash, and Sparkle-driven auto-update. The web admin dashboard at /admin provides real-time monitoring, model management, built-in chat, one-click benchmarking, and a HuggingFace model downloader. The dashboard supports eight languages and all CDN dependencies are vendored for fully offline operation. Per-model settings (sampling parameters, TTL, aliases, profiles) can be changed without a server restart.

    Update: v0.4.4

    The latest release is v0.4.4, published on June 16, 2026. The repository was last pushed on June 30, 2026, indicating active development. Recent additions include Claude Code context scaling support (so auto-compact triggers at the right timing with smaller context models), SSE keep-alive to prevent read timeouts during long prefill, model profiles that expose named setting bundles as separate API model IDs with no extra memory cost, and optional native custom kernels for GLM-5.2 and MiniMax M3 via a HEAD Homebrew build.

    Setup Path

    oMLX can be installed three ways: download the signed and notarized DMG from GitHub Releases, install via Homebrew (brew tap jundot/omlx && brew install omlx), or clone from source with Python 3.10+ and pip install -e .. The macOS app reuses an existing LM Studio model directory with no re-download required. The server listens on localhost:8000 by default and is compatible with any OpenAI-compatible client.

    oMLX - 1

    Community Discussions

    Be the first to start a conversation about oMLX

    Share your experience with oMLX, ask questions, or help others learn from your insights.

    Pricing

    OPEN SOURCE

    Open Source

    Fully free and open source under Apache 2.0. Download the macOS app, install via Homebrew, or build from source.

    • Tiered KV caching (RAM + SSD)
    • Continuous batching
    • Multi-model serving (LLM, VLM, embedding, reranker)
    • OpenAI and Anthropic API compatibility
    • Native macOS menu bar app

    Capabilities

    Key Features

    • Tiered KV caching (hot RAM + cold SSD) with prefix sharing and Copy-on-Write
    • Continuous batching via mlx-lm BatchGenerator
    • Native Swift/SwiftUI macOS menu bar app (not Electron)
    • Multi-model serving: LLM, VLM, OCR, embedding, reranker
    • OpenAI-compatible and Anthropic-compatible API endpoints
    • Tool calling support for all major formats (JSON, Qwen, Gemma, GLM, MiniMax, Mistral, Kimi K2)
    • MCP (Model Context Protocol) tool integration
    • Web admin dashboard with real-time monitoring, chat, and benchmarking
    • HuggingFace model downloader built into admin panel
    • Per-model settings: sampling params, TTL, alias, profiles
    • Model pinning and LRU eviction
    • Vision-Language Model (VLM) support with paged SSD cache
    • Claude Code context scaling and SSE keep-alive
    • One-click integration setup for OpenClaw, OpenCode, Codex, Copilot, Hermes Agent
    • Homebrew install with background service support
    • Fully offline admin dashboard (vendored CDN dependencies)
    • API key authentication
    • Multi-language admin UI (English, Korean, Japanese, Chinese, French, Russian, Spanish, Portuguese)

    Integrations

    Claude Code
    OpenClaw
    Cursor
    OpenCode
    Codex
    Hermes Agent
    GitHub Copilot
    HuggingFace
    MLX (Apple)
    mlx-lm
    mlx-vlm
    MCP (Model Context Protocol)
    LM Studio (model directory reuse)
    Homebrew
    API Available
    View Docs

    Ratings & Reviews

    No ratings yet

    Be the first to rate oMLX and help others make informed decisions.

    Developer

    Jun Kim

    jundot builds oMLX, a macOS-native LLM inference server optimized for Apple Silicon. The project focuses on making local AI practical for real coding workflows by solving KV cache invalidation with a tiered RAM+SSD architecture. oMLX is open source under Apache 2.0 and actively maintained with regular releases.

    Founded 2026
    Seoul, Korea
    1 employees
    Read more about Jun Kim
    WebsiteGitHubX / Twitter
    1 tool in directory

    Similar Tools

    Inception Labs icon

    Inception Labs

    Diffusion-based large language models that generate tokens in parallel, delivering 5x faster inference with best-in-class quality at lower cost.

    DeepTide icon

    DeepTide

    A terminal-native AI coding agent built around DeepSeek, offering a Swift-native macOS binary and a cross-platform CLI with local inference support.

    Bodega Inference Engine icon

    Bodega Inference Engine

    Enterprise-grade local LLM inference engine built specifically for Apple Silicon, featuring a multi-model registry, OpenAI-compatible API, and high-throughput continuous batching.

    Browse all tools

    Related Topics

    Local Inference

    Tools and platforms for running AI inference locally without cloud dependence.

    135 tools

    AI Infrastructure

    Infrastructure designed for deploying and running AI models.

    302 tools

    AI Coding Assistants

    AI tools that help write, edit, and understand code with intelligent suggestions.

    596 tools
    Browse all topics
    Back to all toolsSuggest an edit
    ratings
    discussions