There also is the fact that, unlike o3, Claude Opus 4 (16K) scored 8.6% on the ARC-AGI test. If DeepSeek is evaluated by ARC-AGI and also fails to exceed 10%, then it could imply that CoT alone isn’t enough to deal with ARC-AGI-2 level problems, just like GPT-like architecture, until recently, seemed to fail to deal with ARC-AGI-1 level problems (however, Claude 3.7 and Claude Sonnet 4 scored 13.6% and 23.8% without the chain of thought; what algorithmic breakthroughs did they use?) and that subsequent breakthroughs will be achieved by applying the neuralese memos, multiple CoTs (see the last paragraph in my post) or text-like memos. Unfortunately, the neuralese memos cost interpretability.
There also is the fact that, unlike o3, Claude Opus 4 (16K) scored 8.6% on the ARC-AGI test. If DeepSeek is evaluated by ARC-AGI and also fails to exceed 10%, then it could imply that CoT alone isn’t enough to deal with ARC-AGI-2 level problems, just like GPT-like architecture, until recently, seemed to fail to deal with ARC-AGI-1 level problems (however, Claude 3.7 and Claude Sonnet 4 scored 13.6% and 23.8% without the chain of thought; what algorithmic breakthroughs did they use?) and that subsequent breakthroughs will be achieved by applying the neuralese memos, multiple CoTs (see the last paragraph in my post) or text-like memos. Unfortunately, the neuralese memos cost interpretability.