Baybar comments on Baybar’s Shortform

Baybar 23 Oct 2025 13:38 UTC
3 points
0
An AI company I’ve never heard of called AGI, Inc has a model called AGI-0 that has achieved 76.3% on OSWorld-verified. This would qualify as human-level computer use, at least by that benchmark. It appears on the official OSWorld-verified leaderboard. It does seem like they trained on the benchmark, which could explain some of this. I am curious to see someone test this model.
This is a large increase from the previous state of the art, which has been climbing rapidly since Claude Sonnet 4.5′s September 29th release. At that point, Claude achieved 61.4% on the OSWorld-verified. A scaffolded GPT-5 achieved even higher, 69.9%, on October 3rd. Now, on October 21st, AGI-0, seemingly a frontier computer use model, has outpaced them all, and surpassed the human benchmark in doing so.
AI-2027 projected a 65% on the OSWorld for August 2025. It predicted frontier models scoring 80% on the OSWorld privately in December 2025. It predicted models achieving this score would be available publicly in April 2026. This score on the OsWorld-verified is more than two thirds of the way to the 80% benchmark from the expected August capabilities. This is despite being less than a quarter of the way from August 2025 to an expected public release of a model with these capabilities. Assuming this isn’t just benchmark overfitting, the real world is even or ahead of AI-2027 on this computer usage benchmark.
Even more notably, AI-2027 projected this 80% benchmark would be met by “Agent 1”, their hypothetical leading AI agentic model at the end of 2025. It seems surprising that a frontier model from a new company would achieve something close to this without any of the main players’ (OpenAI, Anthropic, Google) models doing better than 61%. A lot to be curious and skeptical about here.
Update: it has been removed from the OSWorld-verified leaderboard, but they are still claiming to have done it and their results are downloadable.