GPT 5.5 outruns Claude Fable 5 on ALE leaderboard

AIJune 11, 2026 at 01:30 AM

Fable

Read full story

Source: VentureBeat

TLDR: BERKELEY—GPT 5.5 via Codex scored 24% on Berkeley RDI’s Agents’ Last Exam, beating Claude Fable 5 at 22%. The benchmark targets real, long horizon professional work, exposing low pass rates at the hardest tier.

Key Takeaways:

ALE replaces brittle text tests with Generalist Computer Use Agent trials across Brain Eyes Body Hands Feet layers.
On the ALE Leaderboard, Codex gpt 5.5 led with 24.0% pass rate, followed by Ale Claw at 23.0% and Claude Fable 5 at 22.0%.
Hardest Last Exam tasks hit 0.0% for many models, and ALE limits benchmark contamination by keeping 1,300 plus tasks private and rotating them.

This is the kind of benchmark win that feels more like a missing baseline than a victory lap. When the hardest tier collapses to near zero, the scorecard doubles as a warning label for agent hype.

No comments yet. Be the first to share your thoughts!

This is the kind of benchmark win that feels more like a missing baseline than a victory lap. When the hardest tier collapses to near zero, the scorecard doubles as a warning label for agent hype.

Q&A

Why does ALE’s anti cheating setup matter more than higher model IQ scores?

Because the evaluation checks whether an agent can complete verifiable artifacts inside real tool and UI loops, not just produce plausible text. Loophole grading and memorized answers stop being reliable shortcuts.

What changes for developers when a benchmark rewards Eyes and Hands, not just reasoning tokens?

They need training and agent orchestration that emphasizes tool selection, UI interaction, and execution reliability in virtual desktop workflows, not only prompt adherence.

If Claude Fable 5 underperformed despite high expectations, what failure mode does ALE amplify?

The described pattern is step abandonment in multi part workflows. ALE’s long horizon structure punishes forgetting required stages and failing later verifications.

How likely is benchmark contamination to return if private tasks eventually rotate into public?

ALE’s rotation reduces the window for memorization, but it cannot make data leakage impossible. The goal is to keep evaluation useful across model generations by limiting stable, reusable test sets.

What should enterprises ask before trusting an ALE score for real customer deployments?

They should compare Full and Unlicensed scores, check performance on CAD and licensed data dependent tasks, and demand evidence that the agent keeps completing workflows end to end under realistic constraints.

GPT 5.5 outruns Claude Fable 5 on ALE leaderboard

Key Takeaways:

Q&A

Why does ALE’s anti cheating setup matter more than higher model IQ scores?

What changes for developers when a benchmark rewards Eyes and Hands, not just reasoning tokens?

If Claude Fable 5 underperformed despite high expectations, what failure mode does ALE amplify?

How likely is benchmark contamination to return if private tasks eventually rotate into public?

What should enterprises ask before trusting an ALE score for real customer deployments?

Top in AI

SoundHound AI earns a skeptical investor second look

Oracle raises $40 billion more, spooking AI investors

Anthropic CEO warns exponential AI; internal cards disagree

AI fact checking may dull your misinformation detection

Fedora maintainers scramble as suspected AI agent runs amok