Readers might also be interested in:
The scaffolding for GPT-5 Plays Pokemon for a sense of what trying hard to elicit capabilities with game-specific scaffolding looks like, and how that’s different from a domain-general scaffolding like the village’s general computer use + group chat + memories scaffolding
Previous writeups about AI Village:
I think models are generally by default worse at computer use than coding, so I don’t think seeing more errors in Claude Code than AI Village is much evidence that AI Village is under-eliciting capabilities more than Claude Code. I’d guess this applies to Project Vend too though I’m less familiar.
(However, I do think is other evidence to expect that Claude Code under-elicits less than Project Vend/Village is that Claude Code is a major offering from a top lab and I think they have spent a lot more resources on improving its performance than Project Vend/Village, which are relatively small efforts. Also because in general I’m pretty confident much more effort is spent on eliciting coding capabilities and some insights spread from other efforts, e.g. Cursor, Codex, Github Copilot, etc).