DevOps for Context Engineering: Why AI Coding Agents Need a Context Development Lifecycle

Patrick Debois coined "DevOps" in 2009 by asking a simple question: What if ops looked more like dev? At AI Engineer 2026, he asked a structurally similar question about the prompts, rules, and memory files that now drive coding agents: what if context had its own development lifecycle?

His answer, delivered in a talk titled "Context Is the New Code," is that it should, and that most teams are nowhere close to treating it that way.

The Core Argument

Debois's premise is straightforward. When you work with a coding agent today, you are not primarily writing code. You are writing context: system prompts, agent.md files, skill definitions, documentation references, spec-driven plans. The agent generates the code from that context. Which means the quality of what you ship is increasingly a function of the quality of what you feed the model, not the quality of what you type into an editor.

That shift has not been matched by a corresponding shift in engineering practice. Code has version control, code review, unit tests, CI/CD pipelines, and production observability. Context, for most teams, has none of that. It gets copy-pasted, tweaked ad hoc, and shipped with roughly the same rigor as a sticky note.

Debois proposes a Context Development Lifecycle with four phases: Generate, Evaluate, Distribute, and Observe. The framing here is deliberately parallel to the software development lifecycle.

Generate

The generation phase covers everything teams already do, often without recognizing it as a discipline. Writing a system prompt generates context. Pulling in library documentation to prevent the model from hallucinating against an outdated API is generating context. Defining a reusable skill that instructs an agent to first detect a project's package manager, then its ecosystem, and then walk through the setup steps, generates context that replaces what would otherwise require substantial conditional code.

Debois's point here is that code itself is increasingly being converted back into context. A workflow that used to require branching logic across multiple files can now be expressed as a skill: a reusable, distributable unit of instructions. That is a meaningful change in what "writing software" means.

Evaluate

This is where most teams have the largest gap, and where Debois spends the most time.

The analogy he reaches for is a linter. When you change two lines in your agent.md, do you know what the impact will be? For most teams, the honest answer is no. You ship it and see what happens. That is the context equivalent of deploying untested code.

He describes a progression of context testing that maps loosely onto familiar code-testing concepts. The simplest level is structural validation: Does a skill definition have a description? Is it within the required length? That is a linter for context format. The next level is semantic validation: given this context, can the model actually understand what it is supposed to do? You can ask the model itself to evaluate whether the instructions are complete and explicit enough, which Debois compares to a Grammarly for prompts.

More interesting is what he calls the unit-test equivalent. If your agent.md specifies that every API endpoint must use a particular URL prefix, you can write a test that generates an endpoint, then asks an LLM judge whether the generated code follows that rule. Without the context, no model will ever apply your team-specific convention. With the context, the test tells you whether it is working. Change the model, change the prompt, run the suite, and you know what broke.

The integration-test equivalent goes further: give the judge a tool, let it execute the generated code in a sandbox, and verify the endpoint actually responds correctly. The LLM is no longer just reading files; it is running curl commands and checking real behavior.

One important caveat Debois flags: evals are nondeterministic. Running a suite once and treating the result as a pass/fail gate will drive you mad. His practical suggestion is to run each test multiple times and track a pass rate. He frames this using the concept of error budgets: some tests you require to pass nearly every time; others you tolerate more variance in. That is a different mental model from traditional CI, but it's workable.

Distribute

Once context is tested, it needs to move. Checking an agent.md into a repository is the simplest form of distribution: colleagues pull it, zero friction. But Debois argues the natural next step is packaging.

If a team has a reusable skill for, say, setting up a new frontend project, that skill should be installable across multiple projects the way a library is. Skills can contain instructions, scripts, and documents. A registry lets teams discover what packages exist. Debois notes that public registries like the Tessl marketplace already exist, and that most of what is in them is low quality by any reasonable eval standard, but that the pattern itself is sound and will improve.

He also raises the issue of dependency management, with appropriate grimness: context packages will have conflicts, just as code packages do. A React context package and a general frontend guidelines package may contradict each other. Dependency hell is coming for context, and teams should plan for it.

Security follows naturally from distribution. When installing context from external sources, you need to know who built it, which model was used to generate it, and whether it contains credential-leakage or prompt-injection risks. Debois points to Snyk's context-scanning work as an early example of the tooling that will be needed, and draws a parallel to software bills of materials: an AI SBOM for packaged context.

Observe

The observe phase closes the loop. When a skill is being used by other developers or other teams, how do you know if it is still working?

Debois's answer is agent logs. When a coding agent fails to do what a developer expected, that failure is recorded. At the team or organizational scale, you can aggregate those logs, surface patterns of missing context, and use that signal to improve the shared context library. Fix the context once, and the improvement propagates to everyone using that skill.

He extends this to production. Code generated from context runs in production. When it fails there, that failure is also a signal. He describes tooling that instruments generated code, captures failures with their inputs and outputs, and automatically proposes test cases so the same failure does not recur. The feedback loop runs from production back to context.

Why This Framing Matters

The DevOps analogy is not accidental. When Debois made the case for DevOps, the insight was not technical; it was organizational. Ops and dev were doing related work with incompatible practices, and the gap was costing everyone. The fix was not a new tool; it was a shared discipline.

The same dynamic is playing out with context. Individual developers are already honing their prompts and agent.md files. But they are doing it in isolation, without shared standards, without tests, without observability. The gap between what a skilled individual can do a with well-crafted context and what a team can do without any context engineering practice is large and growing.

Birgitta Bockeler's harness engineering framework, published on martinfowler.com, covers adjacent ground from the perspective of feedback controls and feedforward guides around coding agents. Anthropic's own engineering blog on Managed Agents describes how harness assumptions go stale as models improve, and how decoupling components allows practices to evolve without breaking everything. Both pieces reinforce the same underlying point: the infrastructure around the model matters as much as the model itself, and that infrastructure needs to be engineered deliberately.

What Teams Should Do Now

The practical takeaway from Debois's talk is not a tool recommendation. It is a posture shift. A few concrete starting points:

Treat your agent.md or AGENTS.md as a first-class artifact: version it, review changes to it, and know what it does.
Write at least one eval before you ship a context change that affects team-wide behavior. Even a simple LLM-as-judge check that your conventions are being followed is better than nothing.
When a context rule breaks, fix the context, not just the output. The next agent run will hit the same problem if you do not.
If you are managing context across multiple projects, think about packaging and distribution now, before you end up with ten copies of the same skill drifting apart.

The models will keep improving. The context engineering practice is what teams can actually control.

References

Source	URL
anthropic.com	https://www.anthropic.com/engineering/managed-agents
martinfowler.com	https://martinfowler.com/articles/harness-engineering.html
tessl.io	https://tessl.io/blog/context-development-lifecycle-better-context-for-ai-coding-agents/
youtu.be	https://youtu.be/bSG9wUYaHWU

Promoted

Claude Design

Claude Design turns conversation into polished prototypes, slide decks, and one-pagers. Describe what you need, Claude builds a first version, and you refine through inline comments, edits, or sliders — kept on-brand via…

View tool

His answer, delivered in a talk titled "Context Is the New Code," is that it should, and that most teams are nowhere close to treating it that way.

The Core Argument

Debois proposes a Context Development Lifecycle with four phases: Generate, Evaluate, Distribute, and Observe. The framing here is deliberately parallel to the software development lifecycle.

Generate

Evaluate

This is where most teams have the largest gap, and where Debois spends the most time.

Distribute

Observe

The observe phase closes the loop. When a skill is being used by other developers or other teams, how do you know if it is still working?

Why This Framing Matters

What Teams Should Do Now

The practical takeaway from Debois's talk is not a tool recommendation. It is a posture shift. A few concrete starting points:

Treat your agent.md or AGENTS.md as a first-class artifact: version it, review changes to it, and know what it does.
Write at least one eval before you ship a context change that affects team-wide behavior. Even a simple LLM-as-judge check that your conventions are being followed is better than nothing.
When a context rule breaks, fix the context, not just the output. The next agent run will hit the same problem if you do not.
If you are managing context across multiple projects, think about packaging and distribution now, before you end up with ten copies of the same skill drifting apart.

The models will keep improving. The context engineering practice is what teams can actually control.

References

Source	URL
anthropic.com	https://www.anthropic.com/engineering/managed-agents
martinfowler.com	https://martinfowler.com/articles/harness-engineering.html
tessl.io	https://tessl.io/blog/context-development-lifecycle-better-context-for-ai-coding-agents/
youtu.be	https://youtu.be/bSG9wUYaHWU

Promoted

Claude Design

View tool

The Core Argument

Generate

Evaluate

Distribute

Observe

Why This Framing Matters

What Teams Should Do Now

References

Claude Design

About the Author

Comments

DevOps for Context Engineering: Why AI Coding Agents Need a Context Development Lifecycle

The Core Argument

Generate

Evaluate

Distribute

Observe

Why This Framing Matters

What Teams Should Do Now

References

Claude Design

About the Author

Comments