forge
A reliability layer for self-hosted LLM tool-calling that lifts small local models to top-tier performance on multi-step agentic workflows via guardrails and context management.
At a Glance
Fully free and open-source under the MIT License. Install via pip or clone from GitHub.
Engagement
Available On
Listed May 2026
About forge
Forge is an open-source Python framework by Antoine Zambelli that adds a reliability layer on top of self-hosted LLM backends for tool-calling and multi-step agentic workflows. It is published under the MIT license and available on PyPI as forge-guardrails. The framework is backed by a peer-reviewed paper published at ACM (DOI: 10.1145/3786335.3813193).
What It Is
Forge is a middleware and orchestration library designed to make small, locally-run language models (around 8B parameters) reliably execute structured tool-calling workflows. It addresses a core weakness of small models — their tendency to produce malformed tool calls, skip required steps, or lose context over long conversations — through composable guardrails, context compaction strategies, and a proxy server that makes any OpenAI-compatible client benefit from these improvements transparently.
Three Usage Modes
Forge offers three distinct integration patterns:
- WorkflowRunner — A full agentic loop manager. Developers define tools, select a backend, and let Forge handle system prompts, tool execution, context compaction, and guardrails.
SlotWorkerextends this with priority-queued access to a shared GPU inference slot, enabling multi-agent architectures where specialist workflows share hardware. - Guardrails middleware — Composable middleware that plugs into an existing orchestration loop. The developer controls the loop; Forge validates responses, rescues malformed tool calls, and enforces required workflow steps.
- Proxy server — A drop-in OpenAI-compatible proxy (
python -m forge.proxy) that sits between any client (opencode, Continue, aider, etc.) and a local model server, applying guardrails transparently without client-side changes.
Guardrails and Context Management
The guardrail stack includes rescue parsing for malformed tool calls, retry nudges that guide the model back on track, and step enforcement that ensures required workflow steps are completed. Context management is VRAM-aware, with tiered compaction strategies (NoCompact, TieredCompact, SlidingWindowCompact) that keep token budgets within hardware limits. A synthetic respond tool is injected by the proxy to keep small models in tool-calling mode rather than switching to bare text output — the client never sees this internal mechanism.
Backend Support and Eval Results
Forge supports four backends:
- llama-server (llama.cpp) — Recommended; the top 10 eval configurations all run on llama-server.
- Ollama — Easier setup with built-in model management; slightly weaker on harder workloads.
- Llamafile — Single binary, zero dependencies; uses prompt-injected function calling.
- Anthropic — Frontier API baseline for hybrid workflows; no local GPU required.
The project ships a 26-scenario eval harness split into an OG-18 baseline tier and an 8-scenario advanced reasoning tier. According to the repository, the current top self-hosted configuration (Ministral-3 8B Instruct Q8 on llama-server) scores 86.5% across all 26 scenarios and 76% on the hardest tier.
Architecture and Project Structure
The codebase is organized into clearly separated modules: core/ (workflow definition, inference loop, runner, slot worker), guardrails/ (nudge templates, response validator, step enforcer, error tracker), clients/ (Ollama, Llamafile, Anthropic), context/ (manager, compaction strategies, hardware detection), prompts/, tools/, and proxy/. The test suite includes 865 deterministic unit tests that require no LLM backend, plus the eval harness for live model qualification.
Update: Active Development as of May 2026
The repository was created in February 2026 and last pushed in May 2026, indicating active early development. It has accumulated over 1,100 stars and 56 forks according to the GitHub repository metadata. The published ACM paper provides a formal ablation study of the guardrail framework, and the preprint is preserved in the repository as a historical artifact.
Community Discussions
Be the first to start a conversation about forge
Share your experience with forge, ask questions, or help others learn from your insights.
Pricing
Open Source
Fully free and open-source under the MIT License. Install via pip or clone from GitHub.
- Full WorkflowRunner and SlotWorker
- Guardrails middleware
- OpenAI-compatible proxy server
- All backend integrations (Ollama, llama-server, Llamafile, Anthropic)
- 26-scenario eval harness
Capabilities
Key Features
- WorkflowRunner for full agentic loop management
- SlotWorker for priority-queued multi-agent GPU slot sharing
- Composable guardrails middleware for existing orchestration loops
- OpenAI-compatible proxy server with transparent guardrail injection
- Rescue parsing for malformed tool calls
- Retry nudges for model correction
- Required step enforcement
- VRAM-aware context budget management
- Tiered context compaction strategies (NoCompact, TieredCompact, SlidingWindowCompact)
- Synthetic respond tool injection for small model reliability
- 26-scenario eval harness with OG-18 and advanced reasoning tiers
- Batch eval with JSONL output and automatic resume
- 865 deterministic unit tests requiring no LLM backend
- Support for Ollama, llama-server, Llamafile, and Anthropic backends
- Hardware detection for VRAM-aware budgeting
- SSE streaming support in proxy server
