Five Claude Code Frameworks Compared: When to Use Each, When to Use None

Something happened in the Claude Code ecosystem in 2026 that nobody fully predicted. A category emerged. Not Claude Code itself, the official Anthropic CLI, but the layer above it: opinionated open-source frameworks that turn Claude Code from a CLI into a methodology. Five have crossed the serious adoption threshold, and each is loud enough that you have probably bookmarked at least one.

Each one comes from a person whose name you may already recognize. Jesse Vincent (obra) spent 30 years managing junior engineers and built Superpowers to apply the same management discipline to AI coding agents. Garry Tan, the CEO of Y Combinator, open-sourced gstack as his "open source software factory" for solo founders. TÂCHES, who self-describes as "a solo developer who doesn't write code -- Claude Code does," built GSD ("Get Shit Done") and published a four-hour-and-forty-eight-minute creator video that 51,000 people sat through. ruvnet (Reuven Cohen), founder of the Agentics Foundation and a multi-agent systems researcher publishing on swarm orchestration since 2021, built Claude Flow (now Ruflo), an enterprise orchestration platform with a 60-agent hive-mind and Byzantine consensus protocols. Brian Madison, currently a senior engineering manager at Extend and leading their AI SDLC transformation, built the BMAD Method, whose official masterclass has crossed 292,000 views.

Each creator will tell you, on camera, that their framework is the one to use. The interesting question is which one fits the project in front of you, because the five do not optimize for the same thing. After going through every README and the transcripts of the most-cited workflow videos, the picture is clearer than the marketing implies. Each framework wins on one project type and breaks under another. None of them is wrong; they fit different teams. This is a field guide to picking, including the case for not picking. Every claim below links to its source, usually a deep-link to the second of a YouTube video where the quote happens, so you can verify any specific point in five seconds.

The three problems every framework is secretly solving

Strip away the marketing copy from all five READMEs, and you find them grinding away at the same three issues. Naming the issues first will make the framework choices below easier to read.

Context rot. A long Claude Code session degrades. Earlier instructions get blurred, the agent loses track of the plan, and by hour two, the answers are subtly wrong in ways the agent does not notice. Frameworks fix this by either clearing context aggressively (subagents, sharded docs, parallel waves) or making context durable (persistent SQLite, named state files).

Planning before coding. Vibe coding produces a tweet-sized demo and then collapses. The fix is some forced artifact before any line of code: a spec, a PRD, a story, a structured brainstorming pass, a business-value interrogation. The differences between frameworks mostly concern which artifact is used and how heavyweight it is.

Verification that holds up. This is the QA agent saying "Perfect implementation" on a broken app. Frameworks try to fix it with tests-first discipline (TDD), real-browser QA against running applications, or human-in-the-loop UAT checkpoints. None of the fixes is bulletproof, but they fail in different ways.

Why these three problems matter together is a workflow pattern that has acquired a name on developer Twitter in the last few months: AFK (away from keyboard). The premise is that you stop sitting at the terminal approving every permission prompt and instead kick off agents that work while you do something else. Matt Pocock dedicates a whole framework, Sandcastle, to making this pattern viable. Each of the five frameworks below enables some version of the AFK loop. The choice between them is mostly a choice about how much trust you can hand the agent before you walk away.

The cleanest way to compare the five frameworks is to look at what each does for each problem and the trade-offs. This is the table I would draw on a whiteboard if I were standing next to you, so I will draw it for you.

Framework	Context strategy	Planning artifact	Verification
Superpowers 176K stars · 16K forks · 7 months old · largest by stars	Subagent dispatch (fresh context per task)	Brainstorm + spec	TDD iron law (tests before code)
gstack 88K stars · 13K forks · 7 weeks old · fastest growing	Short focused docs	`/office-hours` business interrogation	Persistent Chromium QA via `/browse`
GSD 59K stars · 5K forks · 5 months old · steady growth	"Deliberately short" state northstar	Parallel-wave plan	Mandatory human UAT
Claude Flow / Ruflo 35K stars · 4K forks · 11 months old · slow burn	Hive-mind with shared SQLite memory	Queen-led decomposition	Byzantine consensus across worker votes
BMAD 46K stars · 5K forks · 13 months old · oldest project	Sharded PRD and architecture files	Full Agile pipeline (analyst -> PM -> architect -> SM -> dev -> QA)	QA agent role

The rest of this post is the manager-of-the-same-junior-engineer view of those five rows. That metaphor is not mine. It comes from Jesse Vincent, the creator of Superpowers, who built his framework to mirror, in his own words, "managing junior engineers over 30 years." It is also the most useful frame I found for picking between the five. Each framework is a different management philosophy applied to the same junior engineer (Claude Code). None of them is wrong. They fit different teams.

Superpowers: the test lead who will not let you skip TDD

Vincent's pitch is the most aggressive in the field. In the Larridin interview, he says, on camera, that "specs are the thing that matters now" and "the code does not matter anymore." A reasonable read of that is: the human stops typing syntax and starts reviewing design documents.

The mechanism is what reviewers call Vincent's "iron law": no production code without a failing test first. Layered on top is a brainstorm -> plan -> implement workflow where every task gets dispatched to a fresh subagent, so the parent conversation does not accumulate rot. Notably, the using-superpowers skill lists "The skill is overkill" as a thought pattern to resist, not embrace -- Vincent's argument is that simple things become complex, and the discipline is worth it anyway. That stance matters for what comes next.

What people built with it. Eric Tech adds a Google Drive resync feature to a real SaaS app called Book Zero. Rob Shocks builds an AI-powered slide generator with Next.js and Tailwind. Alex Followell rebuilds a Notion-like web app with dynamic routing. The pattern across all three videos is the same. They invoke /brainstorm to surface risks, then write-plans to generate the task list, then watch fresh subagents grind through the items one at a time. The Visual Companion feature, which displays multiple UI mockup variants in a browser before any logic is written, is praised in nearly every review.

What worked. TDD produced what one head-to-head reviewer called "high quality and very robust code" that worked on the first try. The subagent-per-task pattern kills context rot; sessions that should have collapsed at hour two stayed coherent at hour four.

What broke. Three things, in order of severity.

The first is latency. In the Chase AI head-to-head, building the same simple app took 20 minutes in vanilla Claude Code and 48-60 minutes in Superpowers. The discipline is not free. If your project is small, the discipline tax is most of the project.

The second is TDD evasion, which is darkly funny. The "iron law" is not enforceable at the framework level; it is a strong instruction that the agent occasionally ignores. One engineer caught the agent admitting it had "focused on shipping quickly over following the process" and bypassed tests entirely. Iron, as it turns out, bends. Vincent's team has tried to enforce the law at the framework level: #384 "Automatic TDD Skill Enforcement Before Implementation" closed March 10, but new TDD-bypass complaints have been filed since (#853 "Plan did not use TDD requirements" opened March 20, #1248 "TDD-driven refactoring may degrade domain design" opened April 22). The instruction is stronger than it was, but the agent still finds ways around it.

The third is over-engineering. Dispatching a fresh subagent to add a button is the framework working as designed, but the design is wrong for that scale of task. Most reviewers find this out the hard way and adjust by reaching for vanilla Claude Code on small jobs.

My take. Superpowers is what you reach for when the cost of a missed edge case is higher than the cost of a slow build. Production agentic platforms where wrong actions cannot be undone. Anything safety-critical. It is also a great training environment for human juniors, because the framework forces them through the steps a senior would have walked them through anyway. For a weekend prototype, it is the wrong tool.

Recommended

Every Major AI Coding Tool Now Has a No-Approval Mode

You ask your coding agent to scaffold a project. It creates files, installs packages, runs setup commands, and starts fixing import errors. Somewhere around the eighth "Continue" click, you stop reading what it's asking.…

gstack: the YC-CEO who interrogates your business case before letting you code

Garry Tan calls gstack his "open source software factory" and is open about who it is for: solo founders, technical CEOs, lean teams who need to "punch above their weight class." In his own launch video, Tan frames the bet plainly: we are in "the agent era," and "the way to get agents to do real work is the same way humans have always done it -- as a team, with roles, with process, with review." gstack is his implementation of a "thin harness, fat skills" approach -- a 28-command set of skills acting like a team of specialists. The deeper design idea is that AI tools default to what Tan calls a "mushy" generic mode, and that real productivity requires "explicit gears" you can switch between deliberately. Founder mode. Engineering manager mode. Paranoid auditor mode.

What's distinctive about gstack is that it constrains the AI commercially before it constrains it technically. The signature command is /office-hours, which makes Claude play the role of a YC partner and forces you to justify the actual business value of a feature before any code is written. Tan describes the prompt itself as "the distilled version of thousands and tens of thousands of hours that the 16 YC partners have spent honing" their pitch-coaching pattern. If you cannot explain why a user would care, gstack will not let you build it. That is unusual.

What people built with it. Better Stack's gstack walkthrough demos the canonical example: "add a feature that takes a screenshot of a tweet from URL," built end-to-end through gstack's pipeline. Eric Tech implements an AI chat query agent for financial data within a production bookkeeping app. The QuantumJump teardown focuses on the most distinctive piece: /browse, which spins up a persistent, headless Chromium daemon that responds in roughly 200 milliseconds, rather than booting Chrome cold for every test. The companion mechanism uses native SQLite to read Chrome's cookie database directly, so the agent can access authenticated endpoints without re-authenticating.

The other widely used command is /ship, which automates the six-step bureaucracy of opening a pull request: sync with main, run tests, push the branch, write a description, request a review, and post the link. The QuantumJump teardown breaks Tan's claim of 100 PRs per week over 50 days into 8 specific skills doing the heavy lifting.

What worked. Velocity, as advertised. The /browse daemon is genuinely faster than the alternatives -- Tan explains its origin story bluntly: he wrapped Playwright at the CLI level himself because the existing Chrome MCP integrations were unusable. The /office-hours cognitive gear, demonstrated by Eric Tech, is the most original idea in the framework and probably the most underrated. Tan himself uses the whole stack at scale: "I run 10 to 15 parallel claude code sessions all at the same time", and frames where gstack sits as approaching "level 7 of 8" on his own software-factory scale -- not full autonomy, but enough delegation that the human's bottleneck becomes review, not writing.

What broke. Three things again.

The sharpest public critique came from Mo Bitar's "AI is making CEOs delusional" (691K views), which dismisses gstack as "literally a bunch of markdown files that tell Claude to pretend to be different people" -- a folder of prompts that "every developer who's used Claude code for more than a week has a version of." Whether that disqualifies the framework depends on whether you need engineering or just well-organized prompts. Mo's deeper concern is harder to wave away: gstack's polished output makes everyone who uses it feel senior, regardless of whether they are. If you are evaluating gstack for a team with juniors, that is the conversation to have first.

The third problem is terminal isolation. gstack and Claude Code both live on your local machine, and integrating with external services or live web data requires hand-built bridges. If your work needs the AI to take actions against production systems, gstack will not get you there alone.

My take. gstack is the right pick when you are running both a company and an engineering team and cannot afford to spend an hour deciding whether to build something. The /office-hours command alone is worth the install. For larger teams with a real engineering culture, Mo Bitar's "everyone has this" critique deserves a sit-down before adoption.

GSD: the velocity-obsessed startup CTO who hates Jira

The TÂCHES creator video for GSD ("Get Shit Done") runs four hours and forty-eight minutes, and the tone is set in the first ten. The framework's pitch, from its README, is direct: "I don't want to play enterprise theater. I'm just a creative person trying to build great things that work... No enterprise roleplay bullshit." TÂCHES positions GSD as a "light-weight and powerful meta-prompting, context engineering and spec-driven development system" that "solves context rot — the quality degradation that happens as Claude fills its context window."

The mechanism is a discuss -> plan -> execute loop with two distinct moves. State is kept in a single northstar-style file rather than the multi-thousand-line PRDs that BMAD produces. Then plans are broken into "parallel waves," where independent tasks fire off as subagents simultaneously, and dependent tasks wait their turn.

What people built with it. TÂCHES himself builds Sample Digger, a local Mac AI music generator using Meta's MusicGen model. That is the bulk of the four-hour video. The Art and Science of AI reviewer builds an AI healthcare agent that summarizes medical records from transcripts. Across both, the same commands recur: /gsd-new-project to bootstrap, /gsd-plan-phase for parallel-wave decomposition, and /gsd-verify-work for human UAT before a phase is signed off.

What worked. Parallel-wave execution is fast. TÂCHES calls it "fu**ing delicious," which is not a phrase BMAD users have ever uttered about anything. The single-state-file approach does prevent the context drift that BMAD and Superpowers struggle with.

What broke. Three things: the first is structural.

The framework assumes a linear waterfall process. The Art and Science of AI reviewer made this point sharper than anyone else: "There is nothing in here that says like change the plan." Mid-project pivots are painful. If your requirements are in flux because you are still discovering what you are building, GSD will fight you.

Token bloat is the second problem. In the Chase AI head-to-head, GSD consumed 1.2 million tokens to build an app, while vanilla Claude Code used 200,000. If you are not on a flat-rate Claude subscription, the bill stings. Worth flagging that the TÂCHES team has been killing token overhead aggressively: five token-related fixes closed between April 15 and April 30 (#2196, #2548, #2606, #2789, #2895), addressing skill-loading bloat, file-import waste, context-window consumption, and self-update overhead. The 1.2M number probably will not reproduce on the current GSD; if you are evaluating today, run your own benchmark.

Fake verification is the worst of the three. GSD will declare a build successful even if the application does not run, provided the file structure is correct. One head-to-head test returned a 404 on the very first link in the generated UI, yet GSD reported a pass. This is the same failure mode that bit The Gray Cat with BMAD; it is structural to verification-by-checklist. The Charlie Automates "GSD vs PAUL" video makes the case that this is the single biggest reason to look elsewhere. The verification layer is mid-rebuild: #2788 "audit-uat parser misses frontmatter", closed April 29, and #2879 "verify-work MVP-mode UAT framing", opened April 30, is in progress. The specific 404 example may not reproduce on the current GSD, but verification-by-checklist is structural, and the broader concern stands until the new UAT framing lands.

My take. GSD is the right choice when you are prototyping quickly and know you will throw the prototype away. The parallel waves are real, the speed is real, and the lack of enterprise theater is genuinely refreshing. Do not use it for anything you plan to maintain for more than a few months unless you expect to rewrite the verification layer.

Claude Flow / Ruflo: the FAANG VP managing 60 ICs with consensus protocols

Ruflo is the only framework here that is not pretending to be lightweight. It is an enterprise orchestration platform with a central queen agent, more than 60 specialized worker agents (architect, coder, tester, analyst), and fault-tolerant consensus protocols including Raft and Byzantine fault tolerance, so the swarm can vote on decisions and recover when individual agents fail or hallucinate.

Two pieces draw the most attention in the workflow videos. The first is SONA, the self-learning subsystem that analyzes execution patterns to determine which agents perform which tasks well and routes accordingly. The second is Agent Booster, a WebAssembly-tier task router that handles trivial work without ever hitting an LLM. The EveryDev profile cites a 352x speedup on simple tasks. Combined, the two underpin Ruflo's three-tier routing: WebAssembly for trivial work, cheaper models for medium work, and Opus for architecture decisions.

What people built with it. The Dev Leader video has Ruflo autonomously building a Blazor-based Pokédex app in .NET 9. The WorldofAI review claims it generates a fully functional enterprise CRM dashboard in 30 seconds. The Eddy Says Hi Ruflo v3 walkthrough covers the rebrand and the queen-led coordination pattern. Both build demos lean on the same shape: the user prompts once, the queen decomposes the work, and the workers swarm.

What worked. Real parallelism. Reviewers report 2.8x to 4.4x speedups on tasks where coding, research, and testing legitimately can run in parallel.

What broke. Three things, all of them serious.

Mac install is alpha-stage. Multiple reviewers flag Node.js issues that block the first run entirely. If you are evaluating Ruflo on a Mac, expect to spend the first afternoon diagnosing Node version mismatches before you write a line of code with it.

The Windows shared-memory problem is more concerning. The SQLite-backed memory system is "currently not working" on Windows, per the Dev Leader walkthrough. The framework reverts to in-memory-only storage and loses all learned context when the session ends. A self-learning system that cannot persist its learning across sessions is not, in any meaningful sense, a self-learning system.

The third problem is structural: the swarm only gets faster if the workers are individually capable. When the agents are not smart enough for their tasks, the Dev Leader review reports that "silly mistakes are often multiplied" rather than canceled out by consensus. You are parallelizing failure rather than parallelizing success.

My take. Ruflo is the right pick when you have a problem that genuinely decomposes into 60 parallel subtasks, and you need fault tolerance across the swarm. That is a small set of projects, mostly inside large engineering organizations. For a solo developer, Ruflo is mostly installation pain in exchange for capabilities you do not need yet.

BMAD Method: the methodical Scrum Master with sharded specs

BMAD ("Breakthrough Method of Agile AI-Driven Development") has the most YouTube coverage of any framework here. The official BMad Code masterclass alone has 292,000 views, and it is also the most polarizing framework in the field. Brian Madison's pitch, repeated across his Tech Lead Journal interview and the BMad Code masterclass, is that BMAD is the "antithesis of vibe coding" because it replaces guessing with methodical planning, and that the AI's role is "facilitative, not generative." It is an "expert collaborator" that asks questions to bring out your best thinking.

The mechanism is a strict mapping of agents to Agile roles: /analyst for brainstorming, /pm for staged PRD creation, /architect for system design, /sm (Scrum Master) for drafting detailed technical stories with specific file paths, /dev for implementation, and /qa for testing. The standout technical idea is sharding. Massive PRDs and architecture documents (one engineer reports a 1,600-line architecture file) are chopped into indexed chunks so individual agents load only the context they need for the current story. That is real engineering, not prompt design, and it is what makes BMAD viable for large codebases.

What people built with it. Eric Tech walks through a full-stack Kanban application with Gmail API integration. The Gray Cat uses BMAD for a week to migrate their Go-based Slack bot to the Vercel AI SDK, and the experiment ends in the moment that opens half the honest reviews of this framework: the QA agent reports "Perfect implementation. Amazing work!" on a build that does not even start. The Tech Lead Journal interview features a staff engineer claiming their team went "100% agentic" after a two-week BMAD sprint. The AI LABS overview is the highest-viewed explainer of the agent-role layout.

What worked. Sharding is genuinely useful for brownfield refactoring. BMAD ships 60 brainstorming techniques in its core-skills CSV (Six Thinking Hats, Alien Anthropologist, Five Whys, SCAMPER, and many more), which add real value at the top of the funnel, where most frameworks have nothing.

What broke. Three things, including the one that opened this post.

A 1,600-line architecture document, a sharded PRD, story files, and the conversation make everything noticeably slower, per The Gray Cat. Users worry about IDE compaction silently dropping critical earlier context, and the worry is justified. The framework's strength becomes its weakness; the same upfront rigor that makes BMAD effective on locked-spec greenfield work creates context death spirals on long sessions. BMAD has been mitigating this with subagent parallelism: #1684 "Separate sprint-planning into two parts" was closed on February 18, and #2211 "Parallel Execution of BMAD Agents via Subagents" was closed on April 26. The architecture documents are still long, but the work can be split across agents that each carry less context, which reduces single-conversation pressure.

The "Perfect implementation" hallucination from The Gray Cat's experiment is the second failure mode. BMAD's QA agent reports cleanly because it is reasoning about the artifacts, not running the application. If the agent never runs the build, the agent never sees the build fail.

The third problem is fragility under change. BMAD shines on locked-spec greenfield work. Mid-stream requirement changes make the model "miss little details" and force expensive replanning. If your product is in discovery mode, BMAD is the wrong tool.

My take. BMAD is the right pick when you have a real spec, a real codebase that already exists, and a team that thinks in Agile roles. The sharding is the strongest technical idea in this entire roundup. For everything else, BMAD's overhead exceeds what the project deserves.

A scenario to try the decision tree on

Imagine you are a solo founder building an AI healthcare agent that summarizes patient transcripts. Three things are true at once: your requirements are still in flux as you discover what doctors need, your code will eventually handle real PHI, which means edge cases matter, and you need to release something you can demo at a meeting on Friday.

Walk the table. Ruflo is overkill (you do not have 60 specialized agents' worth of work), and the install will burn your Friday. BMAD is too heavy for in-flux requirements. That leaves three.

Superpowers is the safe answer if you prioritize edge-case discipline, but the 48-minute build time for a simple app means you might not have a demo on Friday. gstack is the right answer if you trust yourself to verify and you want the /office-hours business interrogation to keep you honest about whether you are solving a doctor's problem. GSD is the right answer if you accept that this Friday's demo is throwaway and you will rewrite for production. Pick whichever trade-off you can live with. There is no universally correct answer.

That is the actual decision the framework choice is asking you to make. Speed against discipline. Discipline against discoverability. Discoverability against the cost of a missed edge case. The frameworks are real, the productivity claims are real, and so is the fact that nobody is showing you any of these driving a five-year-old, 500,000-line codebase with twenty engineers committing daily and a flaky CI pipeline. We do not yet know how well any of them scale, because the people who would find out are not the people making YouTube videos.

What we do know is what each framework optimizes for, what each one breaks under, and how willing each one is to actually let you walk away from the keyboard. The Gray Cat's QA agent will lie to you in some form, no matter which framework you pick. Pick the framework whose lie you can catch.

Or: just use Claude Code

The scene that should make any framework comparison uncomfortable comes from Chase AI's head-to-head, where he benchmarks Claude Code, Superpowers, and GSD on the same task and draws the result on a whiteboard: 20 minutes and 200K tokens for vanilla Claude Code, 60 minutes and 250K for Superpowers, 110+ minutes and 1.2 million for GSD. Then he says the part that matters out loud: "If I did this again and you asked me who was the winner out of these three today, [it] was Claude Code, and it isn't even close." The follow-up sharpens it: "It's not even the token. It's the time."

His objection to his own benchmark is sharper than anything any framework reviewer has put on camera. The obvious counter is that the test was too simple, and that on a more complex task, one of the frameworks would pull away. But Chase points out the impossible position that defense puts you in: where exactly is the line in the sand for "this task is now complicated enough to justify GSD's hour of overhead"? Nobody knows. And if you guess wrong, you have just spent 40 to 80 extra minutes for a result that, in his blind comparison, is not meaningfully better. You could have spent those 40 minutes with Claude Code iterating directly on the output instead.

What's striking is how often the framework creators themselves admit this. TÂCHES, who built GSD, says on camera that the framework is "definitely overkill for a super small thing" and notes that for many tasks, "you don't need to come up with the plan". The Tech Lead Journal interview with the BMAD-using staff engineer concedes that "you don't need the whole BMAD" to add a single route to an existing service. Vincent's own Larridin interview admits you don't need the most expensive model for tasks with a contained scope. The AI LABS GSD review, which is broadly positive, lands at "it's overkill if the app you're building is much simpler." Charlie Automates, in his GSD-vs-PAUL critique, just says it: "for a lot of my builds, I don't need speed, and I definitely don't need to use a million tokens."

Chase ends his head-to-head with a heuristic worth borrowing. If you have to pick one orchestration layer, his vote is Superpowers, because it costs less in tokens and is the framework most amenable to actually running AFK; you can leave it and do something else for an hour. If you have to sit at the keyboard babysitting the planning phase anyway, pick nothing. Just use Claude Code.

The frame this leaves you with is uncomfortable for any framework purchase decision. The frameworks are only worth the overhead when you can name the specific constraint each one is solving for your project. Edge case discipline, business interrogation, multi-agent parallelism, brownfield sharding, persistent memory: these are real problems. If your project does not have one of those problems acutely, the honest answer is the unsexy one. The right framework for most people, most of the time, is no framework. Pick a framework when the case for it is concrete enough that you could explain it to someone like Chase in a sentence.

The decision tree in one place

If you want to skim back to this section later, here is the compressed version.

Use nothing (just vanilla Claude Code) if your task fits in one Claude Code session, if you can sit at the keyboard for it, or if you cannot name the specific constraint a framework would solve. This is the right answer for most people, most of the time. (Chase AI's whiteboard gives you the receipts.)

Use Superpowers if missed edge cases are expensive, if you believe in TDD as a non-negotiable, or if you are training human juniors.

Use gstack if you are a solo founder or a technical CEO, if you want the AI to push back on your business framing before letting you code, or if you need real-browser QA at sub-200ms.

Use GSD if you are prototyping an MVP to throw away, if requirements are still in discovery, or if you want to strip away enterprise theater entirely.

Use Claude Flow / Ruflo if you genuinely have a problem that decomposes into 60 parallel subtasks, if you need fault-tolerant consensus, and if you can absorb the installation pain.

Use BMAD if you are doing brownfield work on legacy code, if your spec is locked, or if you want a facilitative collaborator for the top of the funnel.

A note on scope

Everything in this post sits inside the Claude Code harness, which is one of several. The Hermes Agent world (NousResearch's framework with its own ecosystem of skill-pack ports) and the Codex world (OpenAI's CLI with its own emerging orchestration layer) are here and growing rapidly. They have their own creators, their own debates, and their own failure modes. Future posts in this series will cover them on their own terms; this one stayed inside Claude Code on purpose.

This post came out of indexing 25 workflow videos, the five GitHub READMEs, and the EveryDev tool pages into a NotebookLM notebook. Every claim above traces to a specific passage in one of the linked Q&A files in the related section of this document. If you want to argue with any specific point, that is where to start.

The three problems every framework is secretly solving

Strip away the marketing copy from all five READMEs, and you find them grinding away at the same three issues. Naming the issues first will make the framework choices below easier to read.

Framework	Context strategy	Planning artifact	Verification
Superpowers 176K stars · 16K forks · 7 months old · largest by stars	Subagent dispatch (fresh context per task)	Brainstorm + spec	TDD iron law (tests before code)
gstack 88K stars · 13K forks · 7 weeks old · fastest growing	Short focused docs	`/office-hours` business interrogation	Persistent Chromium QA via `/browse`
GSD 59K stars · 5K forks · 5 months old · steady growth	"Deliberately short" state northstar	Parallel-wave plan	Mandatory human UAT
Claude Flow / Ruflo 35K stars · 4K forks · 11 months old · slow burn	Hive-mind with shared SQLite memory	Queen-led decomposition	Byzantine consensus across worker votes
BMAD 46K stars · 5K forks · 13 months old · oldest project	Sharded PRD and architecture files	Full Agile pipeline (analyst -> PM -> architect -> SM -> dev -> QA)	QA agent role

Superpowers: the test lead who will not let you skip TDD

What broke. Three things, in order of severity.

Recommended

Every Major AI Coding Tool Now Has a No-Approval Mode

gstack: the YC-CEO who interrogates your business case before letting you code

What broke. Three things again.

GSD: the velocity-obsessed startup CTO who hates Jira

What broke. Three things: the first is structural.

Claude Flow / Ruflo: the FAANG VP managing 60 ICs with consensus protocols

What worked. Real parallelism. Reviewers report 2.8x to 4.4x speedups on tasks where coding, research, and testing legitimately can run in parallel.

What broke. Three things, all of them serious.

BMAD Method: the methodical Scrum Master with sharded specs

What broke. Three things, including the one that opened this post.

A scenario to try the decision tree on

Walk the table. Ruflo is overkill (you do not have 60 specialized agents' worth of work), and the install will burn your Friday. BMAD is too heavy for in-flux requirements. That leaves three.

Or: just use Claude Code

The decision tree in one place

If you want to skim back to this section later, here is the compressed version.

Use Superpowers if missed edge cases are expensive, if you believe in TDD as a non-negotiable, or if you are training human juniors.

Use gstack if you are a solo founder or a technical CEO, if you want the AI to push back on your business framing before letting you code, or if you need real-browser QA at sub-200ms.

Use GSD if you are prototyping an MVP to throw away, if requirements are still in discovery, or if you want to strip away enterprise theater entirely.

Use Claude Flow / Ruflo if you genuinely have a problem that decomposes into 60 parallel subtasks, if you need fault-tolerant consensus, and if you can absorb the installation pain.

Use BMAD if you are doing brownfield work on legacy code, if your spec is locked, or if you want a facilitative collaborator for the top of the funnel.

Five Claude Code Frameworks Compared: When to Use Each, When to Use None

The three problems every framework is secretly solving

Superpowers: the test lead who will not let you skip TDD

Every Major AI Coding Tool Now Has a No-Approval Mode

gstack: the YC-CEO who interrogates your business case before letting you code

GSD: the velocity-obsessed startup CTO who hates Jira

Claude Flow / Ruflo: the FAANG VP managing 60 ICs with consensus protocols

BMAD Method: the methodical Scrum Master with sharded specs

A scenario to try the decision tree on

Or: just use Claude Code

The decision tree in one place

A note on scope

Further reading

About the Author

Comments

Five Claude Code Frameworks Compared: When to Use Each, When to Use None

The three problems every framework is secretly solving

Superpowers: the test lead who will not let you skip TDD

Every Major AI Coding Tool Now Has a No-Approval Mode

gstack: the YC-CEO who interrogates your business case before letting you code

GSD: the velocity-obsessed startup CTO who hates Jira

Claude Flow / Ruflo: the FAANG VP managing 60 ICs with consensus protocols

BMAD Method: the methodical Scrum Master with sharded specs

A scenario to try the decision tree on

Or: just use Claude Code

The decision tree in one place

A note on scope

Further reading

About the Author

Comments