And soon, since yesterday DeepSeek gave us R1-0528. Very early response has been muted but that does not tell us much either way. DeepSeek themselves call it a ‘minor trial upgrade.’ I am reserving coverage until next week to give people time.
You’ve probably seen this by now, but benchmark results are up[1] and on the usual stuff it’s roughly equivalent to gemini-2.5-pro-0506. Huge gains over the original R1.
I vibe checked it a bit yesterday and came away impressed: it seems much better at writing, instruction-following, and all that other stuff we expect out of LLMs that doesn’t fall under “solving tricky math/code puzzles.”
(On my weird secret personal benchmark, it’s about as bad as most recent frontier models, worse than older models but no longer at the very bottom like the original R1 was.)
There also is the fact that, unlike o3, Claude Opus 4 (16K) scored 8.6% on the ARC-AGI test. If DeepSeek is evaluated by ARC-AGI and also fails to exceed 10%, then it could imply that CoT alone isn’t enough to deal with ARC-AGI-2 level problems, just like GPT-like architecture, until recently, seemed to fail to deal with ARC-AGI-1 level problems (however, Claude 3.7 and Claude Sonnet 4 scored 13.6% and 23.8% without the chain of thought; what algorithmic breakthroughs did they use?) and that subsequent breakthroughs will be achieved by applying the neuralese memos, multiple CoTs (see the last paragraph in my post) or text-like memos. Unfortunately, the neuralese memos cost interpretability.
You’ve probably seen this by now, but benchmark results are up[1] and on the usual stuff it’s roughly equivalent to gemini-2.5-pro-0506. Huge gains over the original R1.
I vibe checked it a bit yesterday and came away impressed: it seems much better at writing, instruction-following, and all that other stuff we expect out of LLMs that doesn’t fall under “solving tricky math/code puzzles.”
(On my weird secret personal benchmark, it’s about as bad as most recent frontier models, worse than older models but no longer at the very bottom like the original R1 was.)
Self-reported by DeepSeek, but a third-party reproduction already exists for at least one of the scores (LiveCodeBench, see leaderboard and PR).
There also is the fact that, unlike o3, Claude Opus 4 (16K) scored 8.6% on the ARC-AGI test. If DeepSeek is evaluated by ARC-AGI and also fails to exceed 10%, then it could imply that CoT alone isn’t enough to deal with ARC-AGI-2 level problems, just like GPT-like architecture, until recently, seemed to fail to deal with ARC-AGI-1 level problems (however, Claude 3.7 and Claude Sonnet 4 scored 13.6% and 23.8% without the chain of thought; what algorithmic breakthroughs did they use?) and that subsequent breakthroughs will be achieved by applying the neuralese memos, multiple CoTs (see the last paragraph in my post) or text-like memos. Unfortunately, the neuralese memos cost interpretability.