Main Menu
  • Tools
  • Developers
  • Topics
  • Discussions
  • News
  • Blogs
  • Builds
  • Contests
Create
    EveryDev.ai
    Sign inSubscribe
    Home
    Tools

    1,819+ AI tools

    • New
    • Trending
    • Featured
    • Compare
    Categories
    • Agents891
    • Coding869
    • Infrastructure377
    • Marketing357
    • Design302
    • Research276
    • Projects271
    • Analytics266
    • Testing160
    • Integration157
    • Data150
    • Security131
    • MCP125
    • Learning124
    • Extensions108
    • Communication107
    • Prompts100
    • Voice90
    • Commerce89
    • DevOps70
    • Web66
    • Finance17
    1. Home
    2. Tools
    3. Bodega Inference Engine
    Bodega Inference Engine icon

    Bodega Inference Engine

    Local Inference

    Enterprise-grade local LLM inference engine built specifically for Apple Silicon, featuring a multi-model registry, OpenAI-compatible API, and high-throughput continuous batching.

    Visit Website

    At a Glance

    Pricing
    Open Source

    Fully free and open-source inference engine available on GitHub.

    Engagement

    Available On

    macOS
    API
    CLI

    Resources

    WebsiteDocsGitHubllms.txt

    Topics

    Local InferenceLLM OrchestrationAI Infrastructure

    Alternatives

    IonRouterSyntheticPaddlePaddle
    Developer
    SRSWTISRSWTI builds high-performance AI inference tooling and fine…

    Listed Apr 2026

    About Bodega Inference Engine

    Bodega Inference Engine delivers enterprise-grade LLM inference directly on Apple Silicon hardware. It provides an OpenAI-compatible REST API with a multi-model registry architecture, allowing multiple models to run simultaneously in isolated subprocesses. Built in Python, it is optimized for Metal Unified Memory and supports language models, multimodal vision models, image generation, and image editing.

    • Multi-model registry — dynamically load, route to, and unload multiple models simultaneously, each running in its own hardware-isolated subprocess via /v1/admin/load-model and /v1/admin/unload-model/{model_id}.
    • OpenAI-compatible API — drop-in replacement for OpenAI's chat completions endpoint; supports streaming, tool calling, JSON mode, and structured outputs via JSON schema constraints.
    • Continuous batching — high-throughput batching engine approaches ~900 tok/s in-process on M4 Max for small models; configurable via cb_max_num_seqs, cb_completion_batch_size, cb_prefill_batch_size, and cb_chunked_prefill_tokens.
    • Speculative decoding — pairs a small draft model with a large target model to achieve 2–3x generation speedup for single-user, latency-sensitive workloads without changing output quality.
    • Prompt caching — native MLX token-index caching bypasses matrix multiplication for recurring prefixes, dramatically reducing time-to-first-token on repeated sequences.
    • Multimodal support — vision-language models accept image URLs or base64-encoded images alongside text prompts using the standard image_url content block format.
    • Image generation & editing — load image generation models (solomon, keshav, rehoboam, etc.) and generate or edit images via /v1/images/generations and /v1/images/edits.
    • Built-in RAG pipeline — self-contained PDF indexing and retrieval using FAISS cosine-similarity; upload documents via /v1/rag/upload and query them via /v1/rag/query.
    • HuggingFace model support — load any HuggingFace text-generation or image-text-to-text model, not just SRSWTI models; supports LoRA adapters, custom chat templates, and quantization.
    • Terminal monitoring — interactive setup script configures a terminal-based monitoring tool, downloads models, and runs benchmarks or an interactive chat shell.
    • Health & queue endpoints — real-time Metal Unified Memory metrics, per-model RAM usage, and queue statistics via /health, /v1/admin/loaded-models, and /v1/queue/stats.
    Bodega Inference Engine - 1

    Community Discussions

    Be the first to start a conversation about Bodega Inference Engine

    Share your experience with Bodega Inference Engine, ask questions, or help others learn from your insights.

    Pricing

    OPEN SOURCE

    Open Source

    Fully free and open-source inference engine available on GitHub.

    • Multi-model registry
    • OpenAI-compatible API
    • Continuous batching
    • Speculative decoding
    • Prompt caching

    Capabilities

    Key Features

    • Multi-model registry with dynamic loading and unloading
    • OpenAI-compatible chat completions API
    • Streaming responses via Server-Sent Events
    • Continuous batching for high-throughput multi-user workloads
    • Speculative decoding for low-latency single-user workloads
    • Prompt caching with MLX token-index cache
    • Structured output via JSON schema constraints
    • Multimodal vision model support
    • Image generation and image editing endpoints
    • Built-in RAG pipeline with FAISS for PDF documents
    • HuggingFace model download and local cache management
    • LoRA adapter support
    • Custom chat template support
    • Reasoning model support with configurable parsers
    • Real-time memory and queue monitoring endpoints
    • Multi-process isolated handler architecture preventing Metal memory leaks
    • Quantization support (4-bit, 8-bit, 16-bit)
    • Chunked prefill for large-context requests
    • Block-aware prefix caching for shared prompts
    • Interactive setup script with benchmarking tools

    Integrations

    HuggingFace Hub
    LM Studio
    MLX
    FAISS
    OpenAI API (compatible)
    Qwen models
    DeepSeek models
    LLaVA
    InternVL
    API Available
    View Docs

    Reviews & Ratings

    No ratings yet

    Be the first to rate Bodega Inference Engine and help others make informed decisions.

    Developer

    SRSWTI

    SRSWTI builds high-performance AI inference tooling and fine-tuned models optimized for Apple Silicon. The team develops the Bodega Inference Engine — a multi-model local inference runtime with an OpenAI-compatible API — alongside a suite of open-weight models published on HuggingFace. SRSWTI focuses on maximizing throughput and memory efficiency on Apple Silicon's unified memory architecture, pushing the boundaries of on-device LLM performance.

    Read more about SRSWTI
    WebsiteGitHub
    1 tool in directory

    Similar Tools

    IonRouter icon

    IonRouter

    High throughput, low cost AI inference API powered by IonAttention, supporting LLMs, vision, image, video, and audio models with OpenAI-compatible endpoints.

    Synthetic icon

    Synthetic

    AI platform providing access to multiple LLMs with subscription or usage-based pricing, offering both UI and API access.

    PaddlePaddle icon

    PaddlePaddle

    An open-source deep learning platform developed by Baidu for industrial-grade AI development and deployment.

    Browse all tools

    Related Topics

    Local Inference

    Tools and platforms for running AI inference locally without cloud dependence.

    60 tools

    LLM Orchestration

    Platforms and frameworks for designing, managing, and deploying complex LLM workflows with visual interfaces, allowing for the coordination of multiple AI models and services.

    72 tools

    AI Infrastructure

    Infrastructure designed for deploying and running AI models.

    172 tools
    Browse all topics
    Back to all tools
    Explore AI Tools
    • AI Coding Assistants
    • Agent Frameworks
    • MCP Servers
    • AI Prompt Tools
    • Vibe Coding Tools
    • AI Design Tools
    • AI Database Tools
    • AI Website Builders
    • AI Testing Tools
    • LLM Evaluations
    Follow Us
    • X / Twitter
    • LinkedIn
    • Reddit
    • Discord
    • Threads
    • Bluesky
    • Mastodon
    • YouTube
    • GitHub
    • Instagram
    Get Started
    • About
    • Editorial Standards
    • Corrections & Disclosures
    • Community Guidelines
    • Advertise
    • Contact Us
    • Newsletter
    • Submit a Tool
    • Start a Discussion
    • Write A Blog
    • Share A Build
    • Terms of Service
    • Privacy Policy
    Explore with AI
    • ChatGPT
    • Gemini
    • Claude
    • Grok
    • Perplexity
    Agent Experience
    • llms.txt
    Theme
    With AI, Everyone is a Dev. EveryDev.ai © 2026