EveryDev.ai
Subscribe
Home
Tools

2,835+ AI tools

  • New
  • Trending
  • Featured
  • Compare
  • Arena
Categories
  • Agents1815
  • Coding1295
  • Infrastructure600
  • Marketing467
  • Projects433
  • Research403
  • Analytics351
  • Design338
  • Security243
  • MCP242
  • Testing238
  • Data230
  • Integration178
  • Prompts160
  • Learning159
  • Communication154
  • Extensions150
  • Voice130
  • Commerce125
  • DevOps108
  • Web80
  • Finance21
AI Tools by Topic
  • AI Coding Assistants
  • Agent Frameworks
  • MCP Servers
  • AI Prompt Tools
  • Vibe Coding Tools
  • AI Design Tools
  • AI Database Tools
  • AI Website Builders
  • AI Testing Tools
  • LLM Evaluations
Follow Us
  • X / Twitter
  • LinkedIn
  • Reddit
  • Discord
  • Threads
  • Bluesky
  • Mastodon
  • YouTube
  • GitHub
  • Instagram
Get Started
  • About
  • Editorial Standards
  • Corrections & Disclosures
  • Community Guidelines
  • Advertise
  • Contact Us
  • Newsletter
  • Submit a Tool
  • Start a Discussion
  • Write A Blog
  • Share A Build
  • Terms of Service
  • Privacy Policy
Explore with AI
  • ChatGPT
  • Gemini
  • Claude
  • Grok
  • Perplexity
Agent Experience
  • llms.txt
Theme
With AI, Everyone is a Dev. EveryDev.ai © 2026
    1. Home
    2. Tools
    3. Agentic Harness Engineering
    Agentic Harness Engineering icon

    Agentic Harness Engineering

    Agent Harness

    An open-source observability system that automatically evolves coding-agent harnesses—system prompts, tools, middleware, skills, and memory—without changing the base model.

    Visit Website

    At a Glance

    Pricing
    Open Source

    Fully open-source under the MIT license, free to use, modify, and distribute.

    Engagement

    Available On

    macOS
    Linux
    API
    CLI

    Resources

    WebsiteDocsGitHubllms.txt

    Topics

    Agent HarnessAgent FrameworksLLM Evaluations

    Alternatives

    harness-kitGambitAutoAgent
    Developer
    china-qijizhifengchina-qijizhifeng is the GitHub organization behind Agentic…

    Listed Jun 2026

    About Agentic Harness Engineering

    Agentic Harness Engineering (AHE) is an open-source framework published on GitHub under the MIT license by Jiahang Lin and the china-qijizhifeng organization. It introduces an observability-driven loop that automatically improves the components wrapping a coding agent—system prompts, tool descriptions, tool implementations, middleware, skills, sub-agents, and long-term memory—while keeping the base model frozen. The accompanying arXiv paper (arXiv:2604.25850, released April 2026) documents the methodology and benchmark results.

    What It Is

    AHE is a meta-engineering framework for coding agents. Rather than fine-tuning or retraining a model, it treats the harness around the model as the artifact to evolve. Each outer iteration follows an evaluate → analyze → improve cycle driven by three observability layers: component observability (via the NexAU framework, which decomposes the harness into seven orthogonal, git-tracked file-level components), experience observability (an Agent Debugger that compresses raw execution traces into layered, sourced reports), and decision observability (an Evolve Agent that proposes evidence-backed edits and is automatically falsified by the next iteration's results).

    How the Evolution Loop Works

    Each iteration runs the current code_agent over a dataset inside isolated E2B sandboxes and writes per-task artifacts: a full step-level trace, a runtime log, and a pass/fail outcome. The Agent Debugger then compresses those traces—routinely exceeding 10 million tokens—into cross-task and per-task analysis reports. The Evolve Agent reads those digests, proposes targeted edits to the seven harness components inside workspace/, and must commit four fields for every change: failure evidence, root cause, targeted fix, and predicted impact. Predictions that do not hold on the next evaluation are rolled back or revised. The loop terminates when a target pass rate or maximum iteration count is reached.

    Benchmark Results and Transfer

    The repository states that across ten iterations, AHE lifts Terminal-Bench 2 pass@1 from 69.7% to 77.0% on GPT-5.4, surpassing the hand-written Codex baseline (71.9%) and self-evolving ACE and TF-GRPO baselines. The repository also reports that AHE on GPT-5.5 reached 84.7% on the Terminal-Bench 2.0 leaderboard (ranked #3 as of 2026-05-15, per the README). The evolved harness is described as transferring without re-evolution to SWE-bench-Verified and to four alternate base models, which the authors interpret as evidence that the evolved components encode general engineering experience rather than benchmark-specific tuning.

    Architecture and Key Components

    • evolve.py — main-loop orchestrator
    • agents/code_agent_simple/ — the coding agent under evolution
    • agents/evolve_agent/ — the meta-agent that performs the improvement step, built on the NexAU framework
    • agents/explore_agent/ — upstream dataset and source-code exploration agent
    • configs/ — base.yaml shared defaults plus per-experiment overlays
    • NexAU — the underlying component framework that exposes the seven harness files as git-tracked units

    The system requires Python ≥ 3.13, uv for dependency management, tmux for session management, an LLM API endpoint, an E2B sandbox API key (SaaS or self-hosted), and a Serper API key for web search used by the Evolve Agent.

    Update: April–May 2026 Release

    The framework was released in April 2026, with the arXiv paper published on April 28, 2026. A blog post on Dawning Road (English and Chinese) followed on April 30, 2026. The leaderboard result (84.7% on Terminal-Bench 2.0 with GPT-5.5, ranked #3) was reported on May 14, 2026. The repository had accumulated 600 stars and 67 forks as of the last recorded update in June 2026. The Agent Debugger component is noted in the README as only partially open-sourced at this time due to company strategy.

    Agentic Harness Engineering - 1

    Community Discussions

    Be the first to start a conversation about Agentic Harness Engineering

    Share your experience with Agentic Harness Engineering, ask questions, or help others learn from your insights.

    Pricing

    OPEN SOURCE

    Open Source

    Fully open-source under the MIT license, free to use, modify, and distribute.

    • Full source code access
    • MIT license
    • Community support via GitHub Issues

    Capabilities

    Key Features

    • Observability-driven automatic harness evolution
    • Three-layer observability: component, experience, and decision
    • NexAU integration for seven orthogonal git-tracked harness components
    • Agent Debugger compresses multi-million-token traces into sourced reports
    • Evolve Agent proposes evidence-backed, falsifiable edits
    • Evaluate → Analyze → Improve outer loop
    • E2B sandbox isolation for every rollout (SaaS and self-hosted)
    • Cross-model harness transfer without re-evolution
    • Configurable base + experiment overlay YAML system
    • tmux-based session management for long-running experiments
    • Resume interrupted experiments from any iteration
    • Skip-eval mode for debugging evolve step in isolation
    • Batch experiment launching
    • Feishu webhook notifications for experiment milestones
    • Web search integration via Serper API for Evolve Agent

    Integrations

    NexAU
    E2B
    Serper
    Langfuse
    Feishu
    OpenAI GPT-5.4
    OpenAI GPT-5.5
    Terminal-Bench 2
    SWE-bench-Verified
    harbor-datasets
    uv
    API Available
    View Docs

    Ratings & Reviews

    No ratings yet

    Be the first to rate Agentic Harness Engineering and help others make informed decisions.

    Developer

    china-qijizhifeng

    china-qijizhifeng is the GitHub organization behind Agentic Harness Engineering (AHE), an open-source framework for observability-driven automatic evolution of coding-agent harnesses. The project is authored by Jiahang Lin and released under the MIT license. AHE integrates with the NexAU component framework and E2B sandbox infrastructure to enable iterative, evidence-backed improvement of LLM agent harnesses without retraining the base model.

    Read more about china-qijizhifeng
    WebsiteGitHub
    1 tool in directory

    Similar Tools

    harness-kit icon

    harness-kit

    A Python toolkit for building and evaluating AI agent harnesses, enabling structured testing and benchmarking of LLM-based agents.

    Gambit icon

    Gambit

    Gambit is an open-source agent harness framework by Bolt Foundry for building, running, and verifying LLM workflows using typed decks.

    AutoAgent icon

    AutoAgent

    An autonomous agent harness engineering tool that lets an AI meta-agent iteratively build, benchmark, and optimize agent configurations overnight without human intervention.

    Browse all tools

    Related Topics

    Agent Harness

    Infrastructure, orchestrators, and task runners that wrap around LLM coding agents — covering session management, context delivery, worktree isolation, architecture enforcement, and issue-to-PR pipelines.

    100 tools

    Agent Frameworks

    Tools and platforms for building and deploying custom AI agents.

    415 tools

    LLM Evaluations

    Platforms and frameworks for evaluating, testing, and benchmarking LLM systems and AI applications. These tools provide evaluators and evaluation models to score AI outputs, measure hallucinations, assess RAG quality, detect failures, and optimize model performance. Features include automated testing with LLM-as-a-judge metrics, component-level evaluation with tracing, regression testing in CI/CD pipelines, custom evaluator creation, dataset curation, and real-time monitoring of production systems. Teams use these solutions to validate prompt effectiveness, compare models side-by-side, ensure answer correctness and relevance, identify bias and toxicity, prevent PII leakage, and continuously improve AI product quality through experiments, benchmarks, and performance analytics.

    91 tools
    Browse all topics
    Back to all toolsSuggest an edit
    ratings
    discussions