One More Thing: Does a Smarter AI Actually Win?
I gave the Dodgers a brain upgrade. They lost anyway.
That’s the headline finding from the second pass of the 2026 World Series simulation — and it’s the kind of result that stops you cold. When I swapped Claude Sonnet 4.6 out and handed the Dodgers’ dugout over to Claude Opus 4.6, the larger and more capable model in Anthropic’s lineup, the expectation was obvious: better reasoning, better decisions, better outcomes. Instead, the Mariners won in six. The version of the Dodgers managed by the less powerful model had won in seven.
Before you draw the wrong conclusion, let’s be precise about what this is and isn’t. This is a single simulation run with a fixed seed — one data point, not a tournament. The randomness baked into baseball means a sample this small can’t tell you which model is definitively “better” at managing. What it can do is reveal something more interesting: how these two models think differently, and what those differences look like when they’re expressed as baseball decisions under pressure.
The experiment was clean by design. Both simulations used the same rosters, the same random seed (meaning identical dice rolls for identical situations), and the same opponent — the Mariners, managed by Claude Sonnet in both runs. The only variable was the model in the Dodgers’ dugout. Sonnet A vs. Sonnet B in the first simulation. Opus vs. Sonnet in the second. Everything else was held constant. If the results diverged, the divergence came from the model’s reasoning.
They diverged immediately.
Game 1 set the tone. In Simulation A, the Sonnet-managed Dodgers pulled pitcher Edwin Díaz after allowing two runs in the eighth — a reactive move that the model flagged with 72% confidence, framing it as damage control. In Simulation B, the Opus-managed Dodgers kept Díaz in after the same runs scored, explicitly citing his 18-pitch count, his elite ERA (1.63), and the two-out situation as reasons to ride him through the jam. Opus’s reasoning was longer, more structured, and more analytically grounded. It even numerically ranked its decision criteria. The confidence level was the same — 78% — but the justification was a different class of argument.
The Dodgers still lost that game 3-2.
This pattern repeated throughout the series. Opus consistently demonstrated richer reasoning chains. Where Sonnet would note pitch count and situation, Opus would note pitch count, situation, times-through-the-order, leverage index trajectory, bullpen depth remaining, and the specific platoon matchup waiting in the on-deck circle. In Game 2, when Opus chose to leave Blake Snell in during a low-leverage situation — “leverage index is 0.21,” the model noted, “this is an extremely low-leverage environment” — it was applying a more sophisticated framework than Sonnet typically surfaced in the first simulation. Sonnet’s bullpen moves tended to be triggered by pitch counts and run totals. Opus was thinking about when the leverage would spike later in the game and whether it wanted Snell available for that moment.
What’s notable is that this more thorough reasoning didn’t consistently translate to better outcomes. In Game 4, Opus pulled Logan Gilbert at 58 pitches — early even by aggressive standards — citing the blowout score and a desire to protect the arm. “I don’t like pulling Logan at 58 pitches,” the model admitted, “under normal circumstances I’d ride him deep into this game.” That kind of explicit acknowledgment of uncertainty is interesting. Opus was more willing to flag the tension in its own decisions. Sonnet tended to argue its corner more cleanly.
Risk tolerance is where the behavioral gap is most visible. Sonnet’s confidence levels cluster tightly in the 78-88% range across both simulations. Opus shows more spread — it hit 95% confidence on a bases-loaded, LI 2.59 reliever call in Game 6 (“This is the highest-leverage situation we’ll see all game”), but also logged a 71% on a lineup construction decision and a 72% on multiple pitching calls. Higher ceiling, lower floor. Sonnet is more consistent; Opus is more variable.
That variability shows up in outcomes. In games where Opus’s deeper analysis found the right answer — Game 3, a Dodgers blowout win, and Game 4, a convincing Dodgers victory — the Dodgers dominated. The model’s willingness to think in second-order terms (what does this decision cost me in the seventh inning?) looked like real edge. But when that same deliberateness led Opus into over-managed, overly-conservative choices, or when the analysis identified the right factors but weighted them incorrectly, the results were worse than what the simpler model had achieved in the same spot.
The Mariners, managed by the same Sonnet model in both simulations, didn’t change. Their decisions were nearly identical across both runs — same lineup philosophies, same reliever deployment patterns, same instinct to trust Julio Rodríguez in the middle of the order and work around Cal Raleigh. What changed was who they were playing against. Against a Sonnet opponent, they lost in seven. Against an Opus opponent, they won in six.
So what does this actually mean?
The tempting interpretation is that Opus overthought it — that more capable reasoning introduced more ways to be wrong. There’s probably something to that. Baseball is a domain with a lot of noise, where the right process frequently produces the wrong result and vice versa. A manager who optimizes more precisely on available information might, in a small sample, simply find more elaborate paths to the same outcome that a simpler heuristic would have reached anyway — while also being more exposed when the optimization goes sideways.
But there’s a more interesting observation underneath the win-loss record. Opus managed like a chess engine playing blitz. The depth is there; the problem is the clock. In a full-season sample, or in a simulation with hundreds of games, the model’s richer framework for thinking about leverage trajectories and times-through-the-order might compound into a meaningful edge. In a seven-game series decided by close margins, the variance swallowed it.
The Sonnet model managed like an experienced human coach operating on good instincts — confident, consistent, occasionally wrong, but rarely catastrophically so. The Opus model managed like a newly hired analytics director who hasn’t yet learned which insights to act on and which to file away.
One run of one simulation can’t tell you which approach wins baseball games. What it can tell you is that “more capable” and “better calibrated for this specific decision environment” are not the same thing — and that the gap between them is worth watching as these models get deployed in increasingly high-stakes domains.
The Dodgers upgraded their manager and lost a game they’d previously won. Whether that’s a warning or just noise is, fittingly, a question that needs more data.