MiniMax Sparse Attention pressures LLM speed limits at M3

AIMay 27, 2026 at 10:00 PM

Read full story

Source: VentureBeat

TLDR: MiniMax’s upcoming M3 adds MiniMax Sparse Attention, claiming 15.6x faster decoding at 1 million tokens versus M2.

Key Takeaways:

MiniMax’s M2 family uses sparse Mixture of Experts and full multi head attention to preserve multi hop reasoning at long contexts.
M3’s MiniMax Sparse Attention keeps real Key Values on a GQA backbone, using block level selection to cut compute bottlenecks.
Early testing targets 15.6x faster decoding at one million tokens, aiming to make ultra long context agent work economically practical.

If M2 was the “we will not break reasoning” vow, M3 looks like the “okay, now we’ll stop paying the worst compute tax” pitch. The real test is whether agent teams can exploit the speed without losing the long document thread.

No comments yet. Be the first to share your thoughts!

Q&A

What could break first if M3 accelerates decoding so aggressively

Long context models can trade speed for stability. Expect scrutiny on error recovery, tool call correctness, and consistency across multi step agent trajectories.

Why does decoding get more expensive as prompts grow

Each generated token depends on all prior tokens, so the model repeatedly reuses and reprocesses context. Faster decoding methods target that repeated backward look.

How does keeping real Key Values change the design challenge

It suggests M3 avoids precision loss from compressing representations. That likely helps accuracy and prefix caching, but still demands careful block selection to prevent new bottlenecks.

Will M3’s approach work well for “agent” workloads, not just benchmarks

Agents mix reasoning with tool execution and branching. If block selection stays coherent across changing states, teams could see fewer stalls and quicker retries during failures.

What happens next for the M series if ultra long context becomes cheap

Expect competition to shift from raw context length to cost per successful task, including synthesis quality, latency in interactive chat, and reliability of autonomous workflows.