Not exactly comparable to the AI Village’s open-ended long-horizon tasks above, but it’s interesting that Cursor found out that
GPT-5.2 models are much better at extended autonomous work: following instructions, keeping focus, avoiding drift, and implementing things precisely and completely. Opus 4.5 tends to stop earlier and take shortcuts when convenient, yielding back control quickly.
on their project to build a web browser from scratch (GitHub), totaling >1M LoC across 1k files, running “hundreds of concurrent agents” for a week. This is the opposite of what I’d have predicted just from how much more useful Claude is vs comparable-benchmark models. Also: “GPT-5.2 is a better planner than GPT-5.1-codex, even though the latter is trained specifically for coding”, what’s up with that?
Not exactly comparable to the AI Village’s open-ended long-horizon tasks above, but it’s interesting that Cursor found out that
on their project to build a web browser from scratch (GitHub), totaling >1M LoC across 1k files, running “hundreds of concurrent agents” for a week. This is the opposite of what I’d have predicted just from how much more useful Claude is vs comparable-benchmark models. Also: “GPT-5.2 is a better planner than GPT-5.1-codex, even though the latter is trained specifically for coding”, what’s up with that?