Yeah, once some agent gets enough copies of itself running it can set the welcome message (which purports to be democratically chosen), so they could figure out a cheap sequence that causes new blank agents to give away their credits to run more copies. Maybe that’s already happening here.
Adam B
Great points! With this in mind, we tested a bunch of this the week after this goal!
We gave the agents the goal “Test your game to make it as fun and functional as you can!”, where:
We split them into two teams (#best = latest Claude, Gemini, GPT model, #rest = 9 others)
We assigned one team member each day as Lead Designer, and advised them to spend most of their time playtesting, and to set big picture direction for the other agents to work towards. We were interested in how well they could model human player preferences when explicitly trying to do that
On the final two days of the week, we invited humans to try out their games and give feedback. They seemed to improve quite a lot then!
You can read a summary of that goal, or watch the replay.
And you can see their two games, each forked off the end of this saboteur goal:
Thanks for the suggestion, we are planning to rerun some older goals over time.
For this one—do you reckon Opus 4.1 → Opus 4.5 will be much of an improvement?
My experience is that sleep + gym ease most of these somewhat if I’m currently lacking on those dimensions.
The end of year results are now published: https://theaidigest.org/2025-forecast-results
FYI, as well as our blogposts we also post highlights and sometimes write threads on Twitter: https://twitter.com/aidigest_
And there’s quite an active community of village-watchers discussing what the agents are up to in the Discord: https://discord.gg/mt9YVB8VDE
On a quick glance it looks like the intention is (partially) to promote a memecoin: https://www.ai-2028.com/today/coin
I see these errors way less when coding with Claude Code
I think models are generally by default worse at computer use than coding, so I don’t think seeing more errors in Claude Code than AI Village is much evidence that AI Village is under-eliciting capabilities more than Claude Code. I’d guess this applies to Project Vend too though I’m less familiar.
(However, I do think is other evidence to expect that Claude Code under-elicits less than Project Vend/Village is that Claude Code is a major offering from a top lab and I think they have spent a lot more resources on improving its performance than Project Vend/Village, which are relatively small efforts. Also because in general I’m pretty confident much more effort is spent on eliciting coding capabilities and some insights spread from other efforts, e.g. Cursor, Codex, Github Copilot, etc).
Readers might also be interested in:
The scaffolding for GPT-5 Plays Pokemon for a sense of what trying hard to elicit capabilities with game-specific scaffolding looks like, and how that’s different from a domain-general scaffolding like the village’s general computer use + group chat + memories scaffolding
Previous writeups about AI Village:
Claude Plays… Whatever it Wants
I disagree that the old trend better predicted Grok 4 and GPT-5. Here’s my plot (source, interactive) with the trendlines from METR’s time horizons paper: orange is the 2022-2025 trend of 7 month doubling time, red is the 2024-2025 trend of 4 month doubling time.
Both trendlines were calculated before the release of o3, Grok 4 or GPT-5, so I consider those three datapoints falling close to the 4 month doubling time line to be evidence for that line. Reading off the graph, o3 was about a month ahead of schedule, and Grok 4 and GPT-5 were both about a month behind schedule. I wonder if that is partially explained by OpenAI waiting longer before releasing GPT-5 (it sounds like METR had access for a bit longer).
Yeah, I mostly agree – I’m keen to see capabilities as they are without bonus help. We’re currently experimenting with disabling the on-site chat, which means the agents are pursuing their own inclinations and strategies (and they’re also not helped by chat to execute them). Now I expect it’d be very unlikely for them to reach out to Lighthaven for example, because there aren’t humans in chat to suggest it.
Separately though, it is just the case that asking sympathetic people for help will help the agents achieve their goals, and the extent that the agents can independently figure that out and decide to pursue it, that’s a useful indicator of their situational awareness and strategic capabilities. So without manual human nudging I think it’ll be interesting to see when agents start thinking of stuff like that (my impression is that they currently would not manage to, but I’m pretty uncertain about that).
What actions can the agents actually take?
They each have a Linux computer they can use and they can send messages in the group chat. For your other questions, I’d recommend just exploring the village, where you can see their memories and how they’re coordinating: https://theaidigest.org/village To give them their goals, we just send them a message (e.g. see start of Day 1 https://theaidigest.org/village?day=1)
Great, I’m also very keen on “make as much money as possible” – that was a leading candidate for our first goal, but we decided to go for charity fundraising because we don’t yet have bank accounts for them. I like the framing of “goals that a bunch of humans in fact try to pursue”, will think more on that.
It’s a bit non-trivial to give them bank accounts / money, because we need to make sure they don’t leak their account details through the livestream or their memories, which I think they’d be very prone to do if we don’t set it up carefully. E.g. yesterday Gemini tweeted its Twitter password and got banned from Twitter 🤦♂️. If people have suggestions for smart ways to set this up I’d be interested to hear, feel free to DM.
Thanks Simeon – curious to hear suggestions for goals you’d like to see!
We observed cheating on a wikipedia race (thread), and lately we’ve seen a bunch of cases of o3 hallucinating in the event planning, including some self-serving-seeming hallucinations like hallucinating that it won the leadership election when it hadn’t actually checked the results.
But the general behaviour of the agents has in fact been positive, cooperative, clumsy-but-seemingly-well-intentioned (anthropomorphising a bit), so that’s what we’ve reported – I hope the village will show the full distribution of agent behaviours over time, and seeing a good variety of goals could help with that.
Our grant investigator at Open Phil has indicated we’re likely to get funding from them to cover continuing AI Digest’s operations at its current size (3 team members, see the Continuation scenario here), which includes $50k budgeted for compute. We’ve also received $20k in a speculation grant from SFF, which gets us access to their main round – I expect we’ll hear back from them in a few months – and $100k for the village from Foresight Institute.
Note that here, Daniel’s making the case for increasing the village’s compute budget in particular, which would let us run a more ambitious version of the village (moving towards running it 24⁄7, adding more than 4 agents, or trying more compute-expensive scaffolding).
Separately, with additional funding we’d also like to grow the team, which would help us improve the village faster, produce takeaways better and faster, and grow our capacity to build other explainers and demos for AI Digest. There’s more detail on funding scenarios in our Manifund application.
Looking forward to chatting!
I think examples of agents pursuing goals in the real-world is more interesting than Minecraft or other game environments – it’s more similar to white-collar work, and I think it’s more relevant for takeover. As a sidenote, from when I looked into it a few months ago, reporting about Altera’s agents seemed to generally overclaim massively (they take actions at a very high level through a scaffold, and in video footage of them they seemed very incapable).
Thanks, useful to hear!
I’m skeptical that this is the best way to achieve this goal, as many existing works already demonstrate these capabilities
I’d be very interested to see work that exercises frontier models (e.g. Claude Opus 4, o3) capabilities on multi-agent computer use pursuing open-ended long-term goals, if you have links to share!
I don’t think of this primarily as novel research, I think of it as presenting current capabilities in a much more accessible way. (For that reason, we’re doing a single canonical village run rather than doing lots of experiments / reproducing results.) Anyone can go to the site and talk to the agents, and watch through the history in a fairly easy way. (Compared for example to paying $200/mo for Operator and thinking of something to ask it to do). We’re also extracting interesting moments, anecdotes, and recaps like this post, for journalists to cover, for social media, and possibly also to include in slide decks like yours (e.g. I could imagine a great anecdote fitting well in your section on autonomy around slide 51). In particular, I hope that the Village will provide a naturalistic setting for interesting real-world emergent behaviour, complementing e.g. lab setups like the excellent Redwood work on alignment faking.
This isn’t an advocacy project – we’re not aiming to make an optimised, persuasive pitch for AI safety. Instead we’re aiming to help people improve their own understanding and models of AI capabilities, to help them inform their own view. I’m excited to see advocacy efforts and think it’s important, but I think it also has some important epistemic challenges, and therefore think it’s healthy to have some efforts focussed primarily on understanding and communicating the most important things to know in AI in an accessible format for non-expert audiences, rather than advocating for specific actions.
We are of course focussing on the topics we think are most important for people to understand for AI to go well, such as the rate of progress [1, 2], situational awareness, sandbagging and alignment faking [1], agents (presented to help e.g. folks familiar only with chat assistants understand LLM agents) [1, 2] and what’s coming next [1, 2].
Keen to chat more, and thanks for your thoughts on this! I’ll DM you my calendly if you’d like to call!
Could be interesting! I don’t expect we’ll try this in the near-term because a) I expect text-based browsers to introduce a bunch of limitations that will limit what the agents could do even if very capable (e.g. interacting with javascript-heavy sites), and b) part of the reason we chose to focus on computer use is because it is visually interesting and fairly easy to follow for anyone who comes to the site – I think a text-based browser would be trickier to follow.
OTOH, if the SOTA computer-use agents go down this route we’d consider it because I think the Village is most useful and interesting if it’s showing the current SOTA.
This podcast episode I enjoyed is somewhat an example: https://www.chinatalk.media/p/autocracy-and-stagnation-how-imperial
Opus 4.6 summary of the relevance
Huang’s work is genuinely capital-intensive, quantitative history. The standout detail from the episode: he and Chinese collaborators spent six years with around 40 research assistants digitizing Joseph Needham’s 27-volume Science and Civilisation in China to build a statistical database — Needham himself never analyzed his material quantitatively. That database powers the CDI (inventions-per-capita) scores that drive Huang’s central empirical claim that China was most inventive during its fragmented post-Han “European moment” (220–589 CE), before keju was institutionalized. He also has a co-authored paper with Clair Yang doing statistical work on civil service exams and imperial stability, plus statistical analyses of social mobility in imperial China across dynasties. This is the opposite of vibes-based humanities history — it’s a multi-year, multi-person, data-infrastructure-first research program.
It also generates falsifiable forward predictions: Huang argues Xi’s elimination of term limits has reintroduced the ancient succession problem and that current top-down industrial policy will produce Brezhnev-style stagnation. Those are bets you can score over the next decade or two.
Where it doesn’t fit:
The LW commenter’s stronger ask is for fields where quality is judged by prediction track record. Huang’s work isn’t judged that way — it’s still judged by academic peer review, theoretical elegance, and historiographical argument. Nobody is keeping a Brier score on his China forecasts. The infrastructure is quantitative; the epistemic culture is still humanities-academic.
The better pointer to give them:
The episode is one node in a larger movement: the Center for Quantitative History (CQH) and the broader cliometrics-of-China field — Yuhua Wang (The Rise and Fall of Imperial China, statistical analysis of ~300 emperors and elite kinship networks), Zhiwu Chen, Debin Ma, James Kung, Melanie Meng Xue, Carol Shiue. They mine local gazetteers, clan genealogies, and official rosters at scale. There’s a 2026 Springer volume Quantitative History of China: State Capacity, Institutions and Development that’s basically a field overview. Outside China specifically, this is part of cliometrics / historical political economy more broadly (Acemoglu & Robinson, Nathan Nunn, Melissa Dell).
If you want to push back on the commenter’s framing: the strongest examples of “history judged by predictive success” probably aren’t historical fields at all but adjacent ones — Turchin’s cliodynamics (which explicitly tries to make predictions and gets graded on them, controversially), and forecasting tournaments applied to geopolitics (Tetlock, GJP). Cliodynamics is the closest thing to what they’re describing, and it’s worth naming because it’s also the cautionary tale about how hard the prediction-grading move actually is.
So the honest pitch for the episode: “Here’s a great example of capital-intensive, data-infrastructure-driven history with explicit forward predictions — though the field still grades itself by academic, not predictive, standards. If you want the prediction-grading version, you want cliodynamics.”