I run AI transformation work in Hamburg and have long been pondering the significance of AI alignment for humanity. AI 2027 resonated deeply with me—it was the first time some of my thoughts and worries were translated into concrete, evidence-based and falsifiable predictions. So I built the AI 2027 Tracker (https://ai2027-tracker.com) to keep score: 53 predictions extracted, tracked individually and updated weekly.
I believe AI is potentially the last human invention. I also believe the gap between “taking this seriously” and “not taking this seriously” will determine which companies, institutions, and societies make it. I want to help close that gap.
Good question. I should probably have been more precise: I don’t think all capability claims are behind, and I agree that some headline benchmark/revenue claims now look broadly on time or even stronger than expected.
The places I had in mind were more specific:
1. SWE-bench timing/comparability. The 85% numerical threshold now looks plausibly crossed in self-reported/leaderboard terms, but it arrived roughly 10-12 months after the mid-2025 target and comparability across scaffolding/eval setups is messy.
2. RE-Bench / AI R&D engineering. I have not seen a clean published 1.3+ RE-Bench result. METR time-horizon evidence is very encouraging, but I would not treat it as equivalent to the specific research-engineering benchmark target.
3. R&D productivity multiplier. This is the big one for me. The evidence for AI being useful inside AI labs is strong, but a clean public demonstration of a 1.5x AI R&D multiplier still seems missing. This is also where the authors’ later timeline revisions seem most relevant.
4. Training compute scale. I don’t treat this as a current falsification, since the 10^28 FLOP run is really a 2027 completion claim, but public estimates still look meaningfully below the aggressive compute path.
On Cybench and OSWorld specifically, I’m less confident saying “behind.” OSWorld’s 65% target looks basically confirmed, just late; the 80% early-2026 target is the part I’d still watch. Cybench also looks much stronger after the newer Mythos/Opus results, though I still care about subset/system-card vs uniform public eval issues.
So my shorter answer is: if “capabilities” means the broad direction of benchmark movement, I agree things look broadly on time. If it means the specific chain from benchmark scores → reliable long-horizon work → AI R&D acceleration, I think the evidence is still mixed, and some key claims are late or not yet cleanly demonstrated.