· Johannes Millan

Racing AI Agents: Run Claude, Codex, and Gemini on the Same Task

ai-coding multi-agent claude-code codex-cli gemini-cli workflow

There are two ways to put parallel agents to work. The first is to give each agent a different task and fan the work out — that’s splitting a feature into parallel tasks. The second is to give several agents the same task and keep whichever result is best.

The second one feels wasteful at first. Why pay three agents to do one job? Because for a certain class of work, the bottleneck isn’t agent time — it’s your time spent coaxing one agent toward a good answer through repeated re-prompts. Racing trades cheap parallel compute for expensive serial iteration, and for hard problems that’s a trade worth making.

Why Race Instead of Re-prompt

AI agents are nondeterministic. Run the same prompt twice and you get two different solutions — sometimes meaningfully different in quality. The usual workflow exploits this serially: you get a mediocre result, re-prompt, get a better one, re-prompt again. Each round costs you a context switch and a few minutes of waiting.

Racing collapses that loop into one parallel round. Instead of sampling the distribution one draw at a time, you take three or four draws at once and pick the best. Three things make this worth it:

  • Agents have different strengths. Claude Code, Codex CLI, and Gemini CLI each shine on different work. On a task where you genuinely don’t know which fits, racing answers the question by just running all three.
  • High-variance tasks reward sampling. Tricky algorithms, gnarly bugs, “make this faster” — tasks where the first attempt is often wrong benefit most from multiple independent shots.
  • Picking is faster than iterating. Choosing between three finished diffs takes minutes. Coaxing one agent to the same quality can take much longer.

When It’s Worth the Tokens — and When It Isn’t

Racing multiplies cost by the number of agents. Be deliberate about when that’s justified.

Worth racing:

  • Hard or ambiguous tasks where the first attempt is frequently wrong
  • Work you’ll have to live with for a long time (core abstractions, tricky algorithms)
  • “I don’t know which agent is best for this” situations
  • Performance or refactoring tasks where there are many valid approaches and quality varies widely

Not worth racing:

  • Mechanical, low-variance work — boilerplate, renames, doc updates. The cheapest capable agent wins every time; just use it.
  • Tasks gated on a spec the agents can’t see. If they’re all missing the same context, they’ll all be wrong the same way.
  • Anything where you can’t quickly tell a good result from a bad one. If you can’t judge the winner, racing just gives you more outputs to agonize over.

The honest framing: racing is for quality on hard tasks, while fanning out different tasks is for throughput on independent work. They solve different problems. Most real sessions use both.

How to Set It Up

The mechanics are the same isolation you’d use for any parallel work: one worktree and branch per attempt, so the contestants never see each other’s changes.

# Same task, three contestants, three isolated branches
git worktree add -b try/claude  ../race-claude  main
git worktree add -b try/codex   ../race-codex   main
git worktree add -b try/gemini  ../race-gemini  main

# Start each agent in its own worktree with the same prompt
cd ../race-claude && claude   "Optimize the report query; keep the API unchanged."
cd ../race-codex  && codex    "Optimize the report query; keep the API unchanged."
cd ../race-gemini && gemini   "Optimize the report query; keep the API unchanged."

All three start from the same main commit and work the identical prompt. When they finish, you have three branches to compare — try/claude, try/codex, try/gemini — each a clean diff against main.

Three Ways to Race

The setup above races different agents, but that’s only one variant. There are three ways to structure a race:

  • Different agents, same prompt. The default. Surfaces which agent fits this specific task.
  • Same agent, N times. Pure variance sampling — useful when you’ve already picked an agent but the task is high-variance and you want the best of several draws.
  • Same agent, varied prompts. Phrase the task three ways (one terse, one with hints, one with constraints) to see which framing produces the best result.

Judging the Winner

Racing only pays off if you can pick the winner quickly. Make the judging objective wherever you can.

Let tests referee. The strongest setup: write (or have an agent write) the test suite first, on main, before the race. Then the winner is partly mechanical — which branch passes the most tests, handles the edge cases, doesn’t regress anything. The contestants are all judged against the same bar.

Compare diffs side by side, not whole files. You’re choosing between deltas. A smaller, more focused diff that passes the same tests usually beats a sprawling one. The full review checklist for AI-generated code applies — run the code, read the tests, check the boundaries. You’re just applying it to several candidates at once.

Score on a few fixed criteria. Correctness first (does it pass and do the right thing), then how well it matches existing conventions, then simplicity. A fast-but-hacky solution that you’ll be debugging next week isn’t the winner.

Graft, don’t just pick. You don’t have to take one branch wholesale. Often the cleanest outcome is the best overall structure from one attempt with a sharp edge-case fix borrowed from another. Pick the base, cherry-pick the good idea, discard the rest.

The Cost Math

Three agents on one task is roughly 3× the token cost of one. A few dollars of extra API spend to get a correct core abstraction on the first try — instead of three rounds of re-prompting and a bug you find in production — is cheap. The same spend on boilerplate is pure waste.

You can also blunt the cost: use the cheapest capable agents as contestants where you can. A race that leans on free or low-cost tiers costs a fraction of three premium runs.

Pitfalls

Racing everything. If every task becomes a three-way race, you’ve tripled your bill and your review load for tasks that never needed it. Reserve it for hard, high-variance work.

No objective judge. Without tests or clear criteria, picking a winner becomes a vibe check, and you’ll second-guess it. Define the bar before the race starts.

Forgetting to clean up the losers. Two of every three branches get discarded. Delete the losing worktrees and branches so they don’t clutter your repo — git worktree remove and git branch -D.

Confusing racing with fanning out. Mixing the two mental models leads to either redundant work or colliding branches — race the same task for quality, fan out different tasks for speed.

How Parallel Code Automates It

Setting up three worktrees, launching three agents with the same prompt, and diffing the results by hand is exactly the kind of plumbing that kills the idea before you try it. Parallel Code is built for both modes of parallel work — fanning out and racing.

For a race — the mode Parallel Code calls AI Arena — you point multiple agents at one task, each lands in its own isolated worktree automatically, and you compare the finished diffs in a built-in viewer, then merge the winner with one action and discard the rest. The comparison that used to mean juggling three terminals and three git diff invocations becomes a single side-by-side review.

  1. Install from the latest release
  2. Create a task and assign several agents to it
  3. Let them run in parallel, each in its own worktree
  4. Review the candidate diffs, keep the best, discard the losers

Key Takeaways

  • Racing runs several agents on the same task and keeps the best result — the opposite of fanning out different tasks
  • It trades cheap parallel compute for expensive serial re-prompting; worth it on hard, high-variance work
  • Don’t race mechanical, low-variance tasks — just use the cheapest capable agent
  • Judge objectively: write tests first, compare diffs side by side, score on fixed criteria
  • You can graft the best parts of multiple attempts instead of picking one wholesale
  • Clean up the losing branches, and reserve racing for tasks where being wrong is costly