forge

Name: forge
Availability: OnlineOnly
Author: Antoine Zambelli

A reliability layer for self-hosted LLM tool-calling that lifts small local models to top-tier performance on multi-step agentic workflows via guardrails and context management.

Visit Website

At a Glance

Pricing

Open Source

Fully free and open-source under the MIT License. Install via pip or clone from GitHub.

Engagement

Available On

Web

API

SDK

CLI

Antoine ZambelliAntoine Zambelli builds open-source Python tooling for self-…

Listed May 2026

About forge

Forge is an open-source Python framework by Antoine Zambelli that adds a reliability layer on top of self-hosted LLM backends for tool-calling and multi-step agentic workflows. It is published under the MIT license and available on PyPI as forge-guardrails. The framework is backed by a peer-reviewed paper published at ACM (DOI: 10.1145/3786335.3813193).

What It Is

Forge is a middleware and orchestration library designed to make small, locally-run language models (around 8B parameters) reliably execute structured tool-calling workflows. It addresses a core weakness of small models — their tendency to produce malformed tool calls, skip required steps, or lose context over long conversations — through composable guardrails, context compaction strategies, and a proxy server that makes any OpenAI-compatible client benefit from these improvements transparently.

Three Usage Modes

Forge offers three distinct integration patterns:

WorkflowRunner — A full agentic loop manager. Developers define tools, select a backend, and let Forge handle system prompts, tool execution, context compaction, and guardrails. SlotWorker extends this with priority-queued access to a shared GPU inference slot, enabling multi-agent architectures where specialist workflows share hardware.
Guardrails middleware — Composable middleware that plugs into an existing orchestration loop. The developer controls the loop; Forge validates responses, rescues malformed tool calls, and enforces required workflow steps.
Proxy server — A drop-in OpenAI-compatible proxy (python -m forge.proxy) that sits between any client (opencode, Continue, aider, etc.) and a local model server, applying guardrails transparently without client-side changes.

Guardrails and Context Management

The guardrail stack includes rescue parsing for malformed tool calls, retry nudges that guide the model back on track, and step enforcement that ensures required workflow steps are completed. Context management is VRAM-aware, with tiered compaction strategies (NoCompact, TieredCompact, SlidingWindowCompact) that keep token budgets within hardware limits. A synthetic respond tool is injected by the proxy to keep small models in tool-calling mode rather than switching to bare text output — the client never sees this internal mechanism.

Backend Support and Eval Results

Forge supports four backends:

llama-server (llama.cpp) — Recommended; the top 10 eval configurations all run on llama-server.
Ollama — Easier setup with built-in model management; slightly weaker on harder workloads.
Llamafile — Single binary, zero dependencies; uses prompt-injected function calling.
Anthropic — Frontier API baseline for hybrid workflows; no local GPU required.

The project ships a 26-scenario eval harness split into an OG-18 baseline tier and an 8-scenario advanced reasoning tier. According to the repository, the current top self-hosted configuration (Ministral-3 8B Instruct Q8 on llama-server) scores 86.5% across all 26 scenarios and 76% on the hardest tier.

Architecture and Project Structure

The codebase is organized into clearly separated modules: core/ (workflow definition, inference loop, runner, slot worker), guardrails/ (nudge templates, response validator, step enforcer, error tracker), clients/ (Ollama, Llamafile, Anthropic), context/ (manager, compaction strategies, hardware detection), prompts/, tools/, and proxy/. The test suite includes 865 deterministic unit tests that require no LLM backend, plus the eval harness for live model qualification.

Update: Active Development as of May 2026

The repository was created in February 2026 and last pushed in May 2026, indicating active early development. It has accumulated over 1,100 stars and 56 forks according to the GitHub repository metadata. The published ACM paper provides a formal ablation study of the guardrail framework, and the preprint is preserved in the repository as a historical artifact.

Community Discussions

Be the first to start a conversation about forge

Share your experience with forge, ask questions, or help others learn from your insights.

Pricing

OPEN SOURCE

Open Source

Fully free and open-source under the MIT License. Install via pip or clone from GitHub.

Full WorkflowRunner and SlotWorker
Guardrails middleware
OpenAI-compatible proxy server
All backend integrations (Ollama, llama-server, Llamafile, Anthropic)
26-scenario eval harness

Capabilities

Key Features

WorkflowRunner for full agentic loop management
SlotWorker for priority-queued multi-agent GPU slot sharing
Composable guardrails middleware for existing orchestration loops
OpenAI-compatible proxy server with transparent guardrail injection
Rescue parsing for malformed tool calls
Retry nudges for model correction
Required step enforcement
VRAM-aware context budget management
Tiered context compaction strategies (NoCompact, TieredCompact, SlidingWindowCompact)
Synthetic respond tool injection for small model reliability
26-scenario eval harness with OG-18 and advanced reasoning tiers
Batch eval with JSONL output and automatic resume
865 deterministic unit tests requiring no LLM backend
Support for Ollama, llama-server, Llamafile, and Anthropic backends
Hardware detection for VRAM-aware budgeting
SSE streaming support in proxy server

Integrations

Ollama

llama-server (llama.cpp)

Llamafile

Anthropic Claude

opencode

Continue

aider

PyPI (forge-guardrails)

Pydantic

API Available

View Docs

Back to all tools Suggest an edit