I’ve discovered that generating a video of yourself with Sora 2 saying something like ‘this video was generated by AI’ is sufficiently freaky enough to make people who know you well, especially those skeptical about AI capabilities, start to freak out a bit.
Thought this might be a useful idea for others trying to persuade people to tune in, and not just auto reject the idea that very capable systems might be right around the corner.
An AI company I’ve never heard of called AGI, Inc has a model called AGI-0 that has achieved 76.3% on OSWorld-verified. This would qualify as human-level computer use, at least by that benchmark. It appears on the official OSWorld-verified leaderboard. It does seem like they trained on the benchmark, which could explain some of this. I am curious to see someone test this model.
This is a large increase from the previous state of the art, which has been climbing rapidly since Claude Sonnet 4.5′s September 29th release. At that point, Claude achieved 61.4% on the OSWorld-verified. A scaffolded GPT-5 achieved even higher, 69.9%, on October 3rd. Now, on October 21st, AGI-0, seemingly a frontier computer use model, has outpaced them all, and surpassed the human benchmark in doing so.
AI-2027 projected a 65% on the OSWorld for August 2025. It predicted frontier models scoring 80% on the OSWorld privately in December 2025. It predicted models achieving this score would be available publicly in April 2026. This score on the OsWorld-verified is more than two thirds of the way to the 80% benchmark from the expected August capabilities. This is despite being less than a quarter of the way from August 2025 to an expected public release of a model with these capabilities. Assuming this isn’t just benchmark overfitting, the real world is even or ahead of AI-2027 on this computer usage benchmark.
Even more notably, AI-2027 projected this 80% benchmark would be met by “Agent 1”, their hypothetical leading AI agentic model at the end of 2025. It seems surprising that a frontier model from a new company would achieve something close to this without any of the main players’ (OpenAI, Anthropic, Google) models doing better than 61%. A lot to be curious and skeptical about here.
Update: it has been removed from the OSWorld-verified leaderboard, but they are still claiming to have done it and their results are downloadable.
I’ve discovered that generating a video of yourself with Sora 2 saying something like ‘this video was generated by AI’ is sufficiently freaky enough to make people who know you well, especially those skeptical about AI capabilities, start to freak out a bit.
Thought this might be a useful idea for others trying to persuade people to tune in, and not just auto reject the idea that very capable systems might be right around the corner.
Yes! Also, making a NotebookLM podcast about your own work is similarly startling to the uninitiated. They sound very human.
I want to prank video call people using my face and voice (and the text content itself) generated in realtime.
An AI company I’ve never heard of called AGI, Inc has a model called AGI-0 that has achieved 76.3% on OSWorld-verified. This would qualify as human-level computer use, at least by that benchmark. It appears on the official OSWorld-verified leaderboard. It does seem like they trained on the benchmark, which could explain some of this. I am curious to see someone test this model.
This is a large increase from the previous state of the art, which has been climbing rapidly since Claude Sonnet 4.5′s September 29th release. At that point, Claude achieved 61.4% on the OSWorld-verified. A scaffolded GPT-5 achieved even higher, 69.9%, on October 3rd. Now, on October 21st, AGI-0, seemingly a frontier computer use model, has outpaced them all, and surpassed the human benchmark in doing so.
AI-2027 projected a 65% on the OSWorld for August 2025. It predicted frontier models scoring 80% on the OSWorld privately in December 2025. It predicted models achieving this score would be available publicly in April 2026. This score on the OsWorld-verified is more than two thirds of the way to the 80% benchmark from the expected August capabilities. This is despite being less than a quarter of the way from August 2025 to an expected public release of a model with these capabilities. Assuming this isn’t just benchmark overfitting, the real world is even or ahead of AI-2027 on this computer usage benchmark.
Even more notably, AI-2027 projected this 80% benchmark would be met by “Agent 1”, their hypothetical leading AI agentic model at the end of 2025. It seems surprising that a frontier model from a new company would achieve something close to this without any of the main players’ (OpenAI, Anthropic, Google) models doing better than 61%. A lot to be curious and skeptical about here.
Update: it has been removed from the OSWorld-verified leaderboard, but they are still claiming to have done it and their results are downloadable.