Agentic Harness Engineering

Name: Agentic Harness Engineering
Availability: OnlineOnly
Author: china-qijizhifeng

An open-source observability system that automatically evolves coding-agent harnesses—system prompts, tools, middleware, skills, and memory—without changing the base model.

Visit Website

At a Glance

Pricing

Open Source

Fully open-source under the MIT license, free to use, modify, and distribute.

Engagement

Available On

macOS

Linux

API

CLI

china-qijizhifengchina-qijizhifeng is the GitHub organization behind Agentic…

Listed Jun 2026

About Agentic Harness Engineering

Agentic Harness Engineering (AHE) is an open-source framework published on GitHub under the MIT license by Jiahang Lin and the china-qijizhifeng organization. It introduces an observability-driven loop that automatically improves the components wrapping a coding agent—system prompts, tool descriptions, tool implementations, middleware, skills, sub-agents, and long-term memory—while keeping the base model frozen. The accompanying arXiv paper (arXiv:2604.25850, released April 2026) documents the methodology and benchmark results.

What It Is

AHE is a meta-engineering framework for coding agents. Rather than fine-tuning or retraining a model, it treats the harness around the model as the artifact to evolve. Each outer iteration follows an evaluate → analyze → improve cycle driven by three observability layers: component observability (via the NexAU framework, which decomposes the harness into seven orthogonal, git-tracked file-level components), experience observability (an Agent Debugger that compresses raw execution traces into layered, sourced reports), and decision observability (an Evolve Agent that proposes evidence-backed edits and is automatically falsified by the next iteration's results).

How the Evolution Loop Works

Each iteration runs the current code_agent over a dataset inside isolated E2B sandboxes and writes per-task artifacts: a full step-level trace, a runtime log, and a pass/fail outcome. The Agent Debugger then compresses those traces—routinely exceeding 10 million tokens—into cross-task and per-task analysis reports. The Evolve Agent reads those digests, proposes targeted edits to the seven harness components inside workspace/, and must commit four fields for every change: failure evidence, root cause, targeted fix, and predicted impact. Predictions that do not hold on the next evaluation are rolled back or revised. The loop terminates when a target pass rate or maximum iteration count is reached.

Benchmark Results and Transfer

The repository states that across ten iterations, AHE lifts Terminal-Bench 2 pass@1 from 69.7% to 77.0% on GPT-5.4, surpassing the hand-written Codex baseline (71.9%) and self-evolving ACE and TF-GRPO baselines. The repository also reports that AHE on GPT-5.5 reached 84.7% on the Terminal-Bench 2.0 leaderboard (ranked #3 as of 2026-05-15, per the README). The evolved harness is described as transferring without re-evolution to SWE-bench-Verified and to four alternate base models, which the authors interpret as evidence that the evolved components encode general engineering experience rather than benchmark-specific tuning.

Architecture and Key Components

evolve.py — main-loop orchestrator
agents/code_agent_simple/ — the coding agent under evolution
agents/evolve_agent/ — the meta-agent that performs the improvement step, built on the NexAU framework
agents/explore_agent/ — upstream dataset and source-code exploration agent
configs/ — base.yaml shared defaults plus per-experiment overlays
NexAU — the underlying component framework that exposes the seven harness files as git-tracked units

The system requires Python ≥ 3.13, uv for dependency management, tmux for session management, an LLM API endpoint, an E2B sandbox API key (SaaS or self-hosted), and a Serper API key for web search used by the Evolve Agent.

Update: April–May 2026 Release

The framework was released in April 2026, with the arXiv paper published on April 28, 2026. A blog post on Dawning Road (English and Chinese) followed on April 30, 2026. The leaderboard result (84.7% on Terminal-Bench 2.0 with GPT-5.5, ranked #3) was reported on May 14, 2026. The repository had accumulated 600 stars and 67 forks as of the last recorded update in June 2026. The Agent Debugger component is noted in the README as only partially open-sourced at this time due to company strategy.

Community Discussions

Be the first to start a conversation about Agentic Harness Engineering

Share your experience with Agentic Harness Engineering, ask questions, or help others learn from your insights.

Pricing

OPEN SOURCE

Open Source

Fully open-source under the MIT license, free to use, modify, and distribute.

Full source code access
MIT license
Community support via GitHub Issues

Capabilities

Key Features

Observability-driven automatic harness evolution
Three-layer observability: component, experience, and decision
NexAU integration for seven orthogonal git-tracked harness components
Agent Debugger compresses multi-million-token traces into sourced reports
Evolve Agent proposes evidence-backed, falsifiable edits
Evaluate → Analyze → Improve outer loop
E2B sandbox isolation for every rollout (SaaS and self-hosted)
Cross-model harness transfer without re-evolution
Configurable base + experiment overlay YAML system
tmux-based session management for long-running experiments
Resume interrupted experiments from any iteration
Skip-eval mode for debugging evolve step in isolation
Batch experiment launching
Feishu webhook notifications for experiment milestones
Web search integration via Serper API for Evolve Agent

Integrations

NexAU

E2B

Serper

Langfuse

Feishu

OpenAI GPT-5.4

OpenAI GPT-5.5

Terminal-Bench 2

SWE-bench-Verified

harbor-datasets

API Available

View Docs

Back to all tools Suggest an edit