Bodega Inference Engine

Name: Bodega Inference Engine
Availability: OnlineOnly
Author: SRSWTI

Enterprise-grade local LLM inference engine built specifically for Apple Silicon, featuring a multi-model registry, OpenAI-compatible API, and high-throughput continuous batching.

Visit Website

At a Glance

Pricing

Open Source

Fully free and open-source inference engine available on GitHub.

Engagement

Available On

macOS

API

CLI

SRSWTINew York, NYEst. 2024

Listed Apr 2026

About Bodega Inference Engine

Bodega Inference Engine delivers enterprise-grade LLM inference directly on Apple Silicon hardware. It provides an OpenAI-compatible REST API with a multi-model registry architecture, allowing multiple models to run simultaneously in isolated subprocesses. Built in Python, it is optimized for Metal Unified Memory and supports language models, multimodal vision models, image generation, and image editing.

Multi-model registry — dynamically load, route to, and unload multiple models simultaneously, each running in its own hardware-isolated subprocess via /v1/admin/load-model and /v1/admin/unload-model/{model_id}.
OpenAI-compatible API — drop-in replacement for OpenAI's chat completions endpoint; supports streaming, tool calling, JSON mode, and structured outputs via JSON schema constraints.
Continuous batching — high-throughput batching engine approaches ~900 tok/s in-process on M4 Max for small models; configurable via cb_max_num_seqs, cb_completion_batch_size, cb_prefill_batch_size, and cb_chunked_prefill_tokens.
Speculative decoding — pairs a small draft model with a large target model to achieve 2–3x generation speedup for single-user, latency-sensitive workloads without changing output quality.
Prompt caching — native MLX token-index caching bypasses matrix multiplication for recurring prefixes, dramatically reducing time-to-first-token on repeated sequences.
Multimodal support — vision-language models accept image URLs or base64-encoded images alongside text prompts using the standard image_url content block format.
Image generation & editing — load image generation models (solomon, keshav, rehoboam, etc.) and generate or edit images via /v1/images/generations and /v1/images/edits.
Built-in RAG pipeline — self-contained PDF indexing and retrieval using FAISS cosine-similarity; upload documents via /v1/rag/upload and query them via /v1/rag/query.
HuggingFace model support — load any HuggingFace text-generation or image-text-to-text model, not just SRSWTI models; supports LoRA adapters, custom chat templates, and quantization.
Terminal monitoring — interactive setup script configures a terminal-based monitoring tool, downloads models, and runs benchmarks or an interactive chat shell.
Health & queue endpoints — real-time Metal Unified Memory metrics, per-model RAM usage, and queue statistics via /health, /v1/admin/loaded-models, and /v1/queue/stats.

Community Discussions

Be the first to start a conversation about Bodega Inference Engine

Share your experience with Bodega Inference Engine, ask questions, or help others learn from your insights.

Pricing

OPEN SOURCE

Open Source

Fully free and open-source inference engine available on GitHub.

Multi-model registry
OpenAI-compatible API
Continuous batching
Speculative decoding
Prompt caching

Capabilities

Key Features

Multi-model registry with dynamic loading and unloading
OpenAI-compatible chat completions API
Streaming responses via Server-Sent Events
Continuous batching for high-throughput multi-user workloads
Speculative decoding for low-latency single-user workloads
Prompt caching with MLX token-index cache
Structured output via JSON schema constraints
Multimodal vision model support
Image generation and image editing endpoints
Built-in RAG pipeline with FAISS for PDF documents
HuggingFace model download and local cache management
LoRA adapter support
Custom chat template support
Reasoning model support with configurable parsers
Real-time memory and queue monitoring endpoints
Multi-process isolated handler architecture preventing Metal memory leaks
Quantization support (4-bit, 8-bit, 16-bit)
Chunked prefill for large-context requests
Block-aware prefix caching for shared prompts
Interactive setup script with benchmarking tools

Integrations

HuggingFace Hub

LM Studio

MLX

FAISS

OpenAI API (compatible)

Qwen models

DeepSeek models

LLaVA

InternVL

API Available

View Docs

Back to all tools

Bodega Inference Engine

Local Inference

Enterprise-grade local LLM inference engine built specifically for Apple Silicon, featuring a multi-model registry, OpenAI-compatible API, and high-throughput continuous batching.

Visit Website

At a Glance

Pricing

Open Source

Fully free and open-source inference engine available on GitHub.

Engagement

35views

Discussions

Available On

macOS

API

CLI

Resources

Website Docs GitHub llms.txt

Topics

Local Inference LLM Orchestration AI Infrastructure

Alternatives

IonRouter ds4.c Rapid-MLX

Developer

SRSWTINew York, NYEst. 2024

Listed Apr 2026

About Bodega Inference Engine

Multi-model registry — dynamically load, route to, and unload multiple models simultaneously, each running in its own hardware-isolated subprocess via /v1/admin/load-model and /v1/admin/unload-model/{model_id}.
OpenAI-compatible API — drop-in replacement for OpenAI's chat completions endpoint; supports streaming, tool calling, JSON mode, and structured outputs via JSON schema constraints.
Continuous batching — high-throughput batching engine approaches ~900 tok/s in-process on M4 Max for small models; configurable via cb_max_num_seqs, cb_completion_batch_size, cb_prefill_batch_size, and cb_chunked_prefill_tokens.
Speculative decoding — pairs a small draft model with a large target model to achieve 2–3x generation speedup for single-user, latency-sensitive workloads without changing output quality.
Prompt caching — native MLX token-index caching bypasses matrix multiplication for recurring prefixes, dramatically reducing time-to-first-token on repeated sequences.
Multimodal support — vision-language models accept image URLs or base64-encoded images alongside text prompts using the standard image_url content block format.
Image generation & editing — load image generation models (solomon, keshav, rehoboam, etc.) and generate or edit images via /v1/images/generations and /v1/images/edits.
Built-in RAG pipeline — self-contained PDF indexing and retrieval using FAISS cosine-similarity; upload documents via /v1/rag/upload and query them via /v1/rag/query.
HuggingFace model support — load any HuggingFace text-generation or image-text-to-text model, not just SRSWTI models; supports LoRA adapters, custom chat templates, and quantization.
Terminal monitoring — interactive setup script configures a terminal-based monitoring tool, downloads models, and runs benchmarks or an interactive chat shell.
Health & queue endpoints — real-time Metal Unified Memory metrics, per-model RAM usage, and queue statistics via /health, /v1/admin/loaded-models, and /v1/queue/stats.

Community Discussions

Be the first to start a conversation about Bodega Inference Engine

Share your experience with Bodega Inference Engine, ask questions, or help others learn from your insights.

Pricing

OPEN SOURCE

Open Source

Fully free and open-source inference engine available on GitHub.

Multi-model registry
OpenAI-compatible API
Continuous batching
Speculative decoding
Prompt caching

Capabilities

Key Features

Multi-model registry with dynamic loading and unloading
OpenAI-compatible chat completions API
Streaming responses via Server-Sent Events
Continuous batching for high-throughput multi-user workloads
Speculative decoding for low-latency single-user workloads
Prompt caching with MLX token-index cache
Structured output via JSON schema constraints
Multimodal vision model support
Image generation and image editing endpoints
Built-in RAG pipeline with FAISS for PDF documents
HuggingFace model download and local cache management
LoRA adapter support
Custom chat template support
Reasoning model support with configurable parsers
Real-time memory and queue monitoring endpoints
Multi-process isolated handler architecture preventing Metal memory leaks
Quantization support (4-bit, 8-bit, 16-bit)
Chunked prefill for large-context requests
Block-aware prefix caching for shared prompts
Interactive setup script with benchmarking tools

Integrations

HuggingFace Hub

LM Studio

MLX

FAISS

OpenAI API (compatible)

Qwen models

DeepSeek models

LLaVA

InternVL

API Available

View Docs

Back to all tools