Human-level Diplomacy was my fire alarm

Epistemic status: AI is my hobby, not my job.

Meta just released their new Diplomacy AI, Cicero. Sure, it was several different modules kludged together in a somewhat clever way, with no fundamentally new ideas or massive scaling, but it was damn impressive.

What surprises me most about Cicero is its ability to go all the way from data input to natural language output with relatively high reliability. The model makes a battle plan, uses natural language to interface with humans, then uses the information gained from said interface to alter its plan. And it does this consistently enough to outmatch the average human Diplomacy player. Maybe it doesn’t represent anything truly new in terms of architecture or scaling. But that’s exactly why I’m so impressed.

It doesn’t have the “human spark”. But the lesson that we’ve been learning over the past few decades is that very impressive tasks, from human level Chess to Go to art, can be accomplished without said spark. I’m starting to suspect that human-level AGI is entirely possible with something like Cicero. If systems like Cicero can remain coherent over longer time scales, we could be seeing the automation of many complex tasks, including the holy grail that is AI research. Writing scientific papers? I’m already starting to use automated hypothesis discovery. Let’s break down how the median academic paper actually gets written.

  1. Somewhere in the universe, an academic feels publication pressure. They start looking at their connections, resources, and potential funding.

  2. They start bouncing ideas and try to figure out how to write the highest impact paper they can with the lowest use of resources. Analysis of other peoples’ data is very tempting at this point. AI can probably be used to search through public data for underutilized but useful datasets.

  3. Funding applications are written. This part can be automated.

  4. They start pulling threads, using their specialty knowledge to come up with the best novel hypotheses they can, and picks the one with the lowest p-value. No multiplicity testing, of course. Automated hypothesis discovery can help greatly here. A language model can probably also be used to estimate future citations for each hypothesis-paper.

  5. Hypothesis + data → set of core ideas that is then expanded into an abstract. I think current transformer models can already do this.

  6. The paper is actually written. Fine-tuned models have published short papers already. A rough template, multiple generations, and a bit of human help can probably crank out a full-length paper without suspicion.

  7. The paper sent to an impact-factor 3 journal, gets bounced a few times between their editor and the academic and is finally accepted after the recommended citations get added. Editing a paper according to a series of recommendations sounds like something that specialized transformer models are doing already.

  8. Publication! Remember that the median paper has 0 citations.

  9. Recommendations for follow-up experiments can also be obtained with AI analysis.

I fully anticipate an explosion of low-impact papers in the next few years, especially from China. With even worse English than normal. But how long until this starts improving? A bad paper is one with a boring/​flawed hypothesis and bad results (p=0.050, after messing with outliers). The same data set and methods, but a more insightful hypothesis, can generate something a good (actually contributes to human knowledge, acts as a stepping stone for future researchers, etc.) paper. I have crawled down a rabbit hole of an interesting idea, realized it didn’t pan out halfway through, and then had to keep working with it since I still had to finish that paper and it was too late to start over. In scientific research, quantity is quality, since you can just chuck the bad ideas in the trash and keep the gems. AI systems tend to go from very subhuman to superhuman performance very fast. I expect a significant increase in scientific productivity over the next 5 years, much of which is attributable to AI.

And of course, the same applies to AI research. Some say that the last 30 years of AI research have been researchers stumbling around in the dark, trying different paradigms with trial and error until they hit upon DL and scaling. I don’t think that’s quite true, but surely the testing of increasingly experimental architectures can be automated.

Part of the reason I’m writing this is that I promised myself I would “pull my fire alarm” when AI achieved human-level open press Diplomacy. But I also want it on the record that I’m short on AGI timelines (50% by 2030). I’ve also shifted my investments toward AI stocks accordingly.


  1. Fully automated cab service in most tier-1 Chinese cities, with 80%+ coverage. I will also encounter one at least once a week. 80% by 2026

  2. 50%+ of a paper I’ve published (by word count) is written by an AI. 80% by 2025, conditional on my continued work in academic-adjacent fields.

  3. An AI paradigm as performance-enhancing as transformers is discovered by AI search. 30% by 2030.