Agentic Harness Engineering
An open-source observability system that automatically evolves coding-agent harnesses—system prompts, tools, middleware, skills, and memory—without changing the base model.
At a Glance
Fully open-source under the MIT license, free to use, modify, and distribute.
Engagement
Available On
Alternatives
Listed Jun 2026
About Agentic Harness Engineering
Agentic Harness Engineering (AHE) is an open-source framework published on GitHub under the MIT license by Jiahang Lin and the china-qijizhifeng organization. It introduces an observability-driven loop that automatically improves the components wrapping a coding agent—system prompts, tool descriptions, tool implementations, middleware, skills, sub-agents, and long-term memory—while keeping the base model frozen. The accompanying arXiv paper (arXiv:2604.25850, released April 2026) documents the methodology and benchmark results.
What It Is
AHE is a meta-engineering framework for coding agents. Rather than fine-tuning or retraining a model, it treats the harness around the model as the artifact to evolve. Each outer iteration follows an evaluate → analyze → improve cycle driven by three observability layers: component observability (via the NexAU framework, which decomposes the harness into seven orthogonal, git-tracked file-level components), experience observability (an Agent Debugger that compresses raw execution traces into layered, sourced reports), and decision observability (an Evolve Agent that proposes evidence-backed edits and is automatically falsified by the next iteration's results).
How the Evolution Loop Works
Each iteration runs the current code_agent over a dataset inside isolated E2B sandboxes and writes per-task artifacts: a full step-level trace, a runtime log, and a pass/fail outcome. The Agent Debugger then compresses those traces—routinely exceeding 10 million tokens—into cross-task and per-task analysis reports. The Evolve Agent reads those digests, proposes targeted edits to the seven harness components inside workspace/, and must commit four fields for every change: failure evidence, root cause, targeted fix, and predicted impact. Predictions that do not hold on the next evaluation are rolled back or revised. The loop terminates when a target pass rate or maximum iteration count is reached.
Benchmark Results and Transfer
The repository states that across ten iterations, AHE lifts Terminal-Bench 2 pass@1 from 69.7% to 77.0% on GPT-5.4, surpassing the hand-written Codex baseline (71.9%) and self-evolving ACE and TF-GRPO baselines. The repository also reports that AHE on GPT-5.5 reached 84.7% on the Terminal-Bench 2.0 leaderboard (ranked #3 as of 2026-05-15, per the README). The evolved harness is described as transferring without re-evolution to SWE-bench-Verified and to four alternate base models, which the authors interpret as evidence that the evolved components encode general engineering experience rather than benchmark-specific tuning.
Architecture and Key Components
evolve.py— main-loop orchestratoragents/code_agent_simple/— the coding agent under evolutionagents/evolve_agent/— the meta-agent that performs the improvement step, built on the NexAU frameworkagents/explore_agent/— upstream dataset and source-code exploration agentconfigs/—base.yamlshared defaults plus per-experiment overlays- NexAU — the underlying component framework that exposes the seven harness files as git-tracked units
The system requires Python ≥ 3.13, uv for dependency management, tmux for session management, an LLM API endpoint, an E2B sandbox API key (SaaS or self-hosted), and a Serper API key for web search used by the Evolve Agent.
Update: April–May 2026 Release
The framework was released in April 2026, with the arXiv paper published on April 28, 2026. A blog post on Dawning Road (English and Chinese) followed on April 30, 2026. The leaderboard result (84.7% on Terminal-Bench 2.0 with GPT-5.5, ranked #3) was reported on May 14, 2026. The repository had accumulated 600 stars and 67 forks as of the last recorded update in June 2026. The Agent Debugger component is noted in the README as only partially open-sourced at this time due to company strategy.
Community Discussions
Be the first to start a conversation about Agentic Harness Engineering
Share your experience with Agentic Harness Engineering, ask questions, or help others learn from your insights.
Pricing
Open Source
Fully open-source under the MIT license, free to use, modify, and distribute.
- Full source code access
- MIT license
- Community support via GitHub Issues
Capabilities
Key Features
- Observability-driven automatic harness evolution
- Three-layer observability: component, experience, and decision
- NexAU integration for seven orthogonal git-tracked harness components
- Agent Debugger compresses multi-million-token traces into sourced reports
- Evolve Agent proposes evidence-backed, falsifiable edits
- Evaluate → Analyze → Improve outer loop
- E2B sandbox isolation for every rollout (SaaS and self-hosted)
- Cross-model harness transfer without re-evolution
- Configurable base + experiment overlay YAML system
- tmux-based session management for long-running experiments
- Resume interrupted experiments from any iteration
- Skip-eval mode for debugging evolve step in isolation
- Batch experiment launching
- Feishu webhook notifications for experiment milestones
- Web search integration via Serper API for Evolve Agent
