TLDR: BERKELEY—GPT 5.5 via Codex scored 24% on Berkeley RDI’s Agents’ Last Exam, beating Claude Fable 5 at 22%. The benchmark targets real, long horizon professional work, exposing low pass rates at the hardest tier.
Key Takeaways:
- ALE replaces brittle text tests with Generalist Computer Use Agent trials across Brain Eyes Body Hands Feet layers.
- On the ALE Leaderboard, Codex gpt 5.5 led with 24.0% pass rate, followed by Ale Claw at 23.0% and Claude Fable 5 at 22.0%.
- Hardest Last Exam tasks hit 0.0% for many models, and ALE limits benchmark contamination by keeping 1,300 plus tasks private and rotating them.
This is the kind of benchmark win that feels more like a missing baseline than a victory lap. When the hardest tier collapses to near zero, the scorecard doubles as a warning label for agent hype.
This is the kind of benchmark win that feels more like a missing baseline than a victory lap. When the hardest tier collapses to near zero, the scorecard doubles as a warning label for agent hype.
Q&A
Why does ALE’s anti cheating setup matter more than higher model IQ scores?
Because the evaluation checks whether an agent can complete verifiable artifacts inside real tool and UI loops, not just produce plausible text. Loophole grading and memorized answers stop being reliable shortcuts.
What changes for developers when a benchmark rewards Eyes and Hands, not just reasoning tokens?
They need training and agent orchestration that emphasizes tool selection, UI interaction, and execution reliability in virtual desktop workflows, not only prompt adherence.
If Claude Fable 5 underperformed despite high expectations, what failure mode does ALE amplify?
The described pattern is step abandonment in multi part workflows. ALE’s long horizon structure punishes forgetting required stages and failing later verifications.
How likely is benchmark contamination to return if private tasks eventually rotate into public?
ALE’s rotation reduces the window for memorization, but it cannot make data leakage impossible. The goal is to keep evaluation useful across model generations by limiting stable, reusable test sets.
What should enterprises ask before trusting an ALE score for real customer deployments?
They should compare Full and Unlicensed scores, check performance on CAD and licensed data dependent tasks, and demand evidence that the agent keeps completing workflows end to end under realistic constraints.
No comments yet. Be the first to share your thoughts!