TLDR: Ethan Mollick tests Claude 5 Fable and says it beats prior models by a wide margin, executing hours long research and coding with minimal user control. It matters because output quality rises, while the decision process stays opaque and token costs spike, even as security guardrails block cybersecurity use.
Key Takeaways:
- Mollick tested Mythos Claude Fable despite security guardrails, focusing instead on coding, research, math, and multi step execution.
- In Claude Code, Fable built an 1881 style isochrone map after spawning agents and pulling 2,200 plus flight, train, and road data points, then refined remote routes with adversarial checks.
- Fable delivered a nine and a half hour calibration tool for human and AI judgments, but its black box choices and high token burn shift users from steering processes to commissioning outcomes.
The wow factor is real, but the eerie part is who gets to see the steering wheel. With Mythos, you commission a whole studio, then watch finished work arrive.
The wow factor is real, but the eerie part is who gets to see the steering wheel. With Mythos, you commission a whole studio, then watch finished work arrive.
Q&A
If models hide their internal decision making, how will teams audit errors in high stakes outputs?
They may rely on transcripts when available, independent re runs, source lists, and external test suites that validate assumptions rather than trusting the modelās reasoning narrative.
Why do projects like maps and calibration software reveal more than benchmark games?
They force the model to juggle research, math, judgment calls, and code execution together, exposing how well delegation and verification work when stakes and edge cases pile up.
What happens when token costs keep rising while interfaces still limit midstream steering?
Users will try shorter, more structured prompts, demand reusable components, and push for better control surfaces that interrupt workflows before expensive mistakes lock in.
How could guardrails that block cybersecurity use shape future model deployment?
Teams may treat safety filters as a product feature that reroutes suspicious requests to weaker models, which can frustrate users while still preventing dangerous misuse.
If the human role shifts toward commissioning outcomes, what becomes the new bottleneck for adoption?
Specification quality and evaluation. People who can translate messy goals into testable targets and validate results will matter as much as people who can write prompts.
No comments yet. Be the first to share your thoughts!