Thoughts on sharing information about language model capabilities

Core claim

I believe that sharing information about the capabilities and limits of existing ML systems, and especially language model agents, significantly reduces risks from powerful AI—despite the fact that such information may increase the amount or quality of investment in ML generally (or in LM agents in particular).

Concretely, I mean to include information like: tasks and evaluation frameworks for LM agents, the results of evaluations of particular agents, discussions of the qualitative strengths and weaknesses of agents, and information about agent design that may represent small improvements over the state of the art (insofar as that information is hard to decouple from evaluation results).


ARC Evals currently focuses on evaluating the capabilities and limitations of existing ML systems, with an aim towards understanding whether or when they may be capable enough to pose catastrophic risks. Current evaluations are particularly focused on monitoring progress in language model agents.

I believe that sharing this kind of information significantly improves society’s ability to handle risks from AI, and so I am encouraging the team to share more information. However this issue is certainly not straightforward, and in some places (particularly in the EA community where this post is being shared) I believe my position is controversial.

I’m writing this post at the request of the Evals team to lay out my views publicly. I am speaking only for myself. I believe the team is broadly sympathetic to my position, but would prefer to see a broader and more thorough discussion about this question.

I do not think this post presents a complete or convincing argument for my beliefs. The purpose is mostly to outline and explain the basic view, at a similar level of clarity and thoroughness to the arguments against sharing information (which have mostly not been laid out explicitly).

Added 81: Evals has just published a description of some of their work evaluating GPT-4 and Claude. Their publication does not include transcripts, the details of the LM agents they evaluated, or detailed qualitative discussion of the strengths and weaknesses of the agents they evaluated. I believe that eventually Evals should be considerably more liberal about sharing that kind of information; this post will explain why I believe that.

Accelerating LM agents seems neutral (or maybe positive)

I believe that having a better understanding of LM agents increases safety[1] through two channels:

  • LM agents are an unusually safe way to build powerful AI systems. Existing concerns about AI takeover are driven primarily by scaling up black-box optimization,[2] and increasing our reliance on human-comprehensible decompositions with legible interfaces seems like it significantly improves safety. I think this is a large effect size and is suggested by several independent lines of reasoning.

  • If LM agents are weak are due to exceptionally low investment and understanding it creates “dry tinder:” as incentives rise that investment will quickly rise and so low-hanging fruit will be picked. While there is some dependence on serial time, I think that increased LM investment now will significantly slow down progress later.

I will discuss these mechanisms in more detail in the rest of this section.

I also think that accelerating LM agents will drive investment in improving and deploying ML systems, and so can reduce time available to react to risk. As a result I’m ambivalent about the net effect of improving the design of LM agents—my personal tentative guess is that it’s positive, but I would be hesitant about deliberately accelerating LM agents to improve safety (moreover I think this would be a very unleveraged approach to improving safety[3] and would strongly discourage anyone from pursuing it).

But this means that I am significantly less concerned about information about LM agent capabilities accelerating progress on LM agents. Given that I am already positively disposed towards sharing information about ML capabilities and limitations despite the risk of acceleration, I am particularly positive about sharing information in cases where the main cost is accelerating LM agents.

Improvements in LM agents seem good for safety

Language model agents are built out of LM parts that solve human-comprehensible tasks, composed along human-comprehensible interfaces. Progress in understanding LM agents seems relevant for improving agents built this way, while having at best marginal relevance for systems optimized end to end (to which I expect the “bitter lesson” to apply strongly) or for situations where individual ML invocations are just “cogs in the Turing machine.”

I think this kind of ML system seems great for safety:

  • If we give a system like AutoGPT a goal, it pursues that goal by taking individual steps that a human would rate highly based on their understanding of that goal. The LM’s guesses about what humans would do intervenes at every step, and even current language models would avoid pursuing a sub-plan that humans would consider unacceptable. There is no constraint towards specifying measurable goals of the kind that lead to reward-hacking concerns. It now appears like LM agents could scale up to human-level usefulness before we start seeing any serious form of reward hacking.

  • If LM agents work better, then we will reach any given level of AI using weaker individual models (e.g. a level sufficient for AI to help with alignment, or to help enable or motivate a policy reaction). Deceptive alignment is directly tied to the complexity of the model that we are optimizing end-to-end, and so it becomes significantly less likely the more we rely on interfaces that are designed by humans rather than optimized.

  • Setting aside individual threat models, decomposing complicated tasks down into simpler parts that humans understand is a natural way to improve safety. It makes it easier to tell what the overall system is doing and why, and provides many additional levers to intervene on how the model thinks or acts.

  • Turning our attention from threat models to positive visions, I believe that LM agents based on chain of thought and decomposition seem like the most plausible approach to bootstrapping subhuman systems into trusted superhuman systems. For about 7 years using LM agents for RLAIF has seemed like the easiest path to safety,[4] and in my view this is looking more and more plausible over time.

So at a fixed level of capability, I think the more we are relying on LM agents (rather than larger LMs) the safer we are.

As mentioned before, I do think that progress in LM agents will increase overall investment in ML, and not just LM agent performance. And to a significant extent I think the success of LM agents will be determined by technical factors rather than how much investment there is (although this also makes me more skeptical about the acceleration impacts). But if it weren’t for these considerations I would think that progress on LM agents would be clearly and significantly positive.

“Overhang” in LM agents seems risky

Right now people are investing billions of dollars in scaling up LMs. If people only invested millions of dollars in improving LM agents, and such agents were important for overall performance, then I think we would be faced with a massive “overhang:” small additional investments in LM agents could significantly improve overall AI performance.

Under these conditions, increasing investment to speed up LM agents today is likely to slow down LM agents in the future, picking low-hanging fruit that would instead have been picked later when investment increased. If I had to guess, I’d say that accelerating AI progress by 1 day today by improving LM agents would give us back 0.5 days later. (This clawback comes not just from future investments in general agents, but also in the domain-specific investment needed to make a valuable product in any given domain). I am sympathetic to a broad range of estimates, from 0 to 0.9.[5]

This leaves us with an ambiguous sign, because time later seems much more valuable than time now:[6]

  • As AI systems become risky, I expect technical work on risk to increase radically. I expect us to study failures in the lab, and work on systems more closely analogous to those we care about. If I had to guess I’d say that having an extra day while AI systems are very powerful is probably 2x better than a day now (and in many cases much more).

  • Policy responses to AI seem to be driven largely by the capabilities of AI systems. I think that having 1 extra day for policy progress today would be way better than 2 extra days a few years ago, and I expect multiple further doublings.[7]

So even if LM agents had no relevance for safety, I would feel ambivalent about whether it is good to speed up or slow them down. (I feel similar ambivalence about many forms of pause, and as I’ve mentioned I feel like higher investment in the past would quite clearly have slowed down progress now and would probably be net positive, but I think LM agents are an unusually favorable case.)

If you told me that existing language models could already be transformative with the right agent design, I think this position would become stronger rather than weaker. I think in that scenario the overwhelmingly most important game is noticing this overhang and slowing down progress past GPT-4, and from starting to get transformative work out of relatively safe modern ML systems rather than overshooting badly.

I think this overhang argument applies to some extent for most investments in 2023; for example if AI labs buy all the GPUs today then they will get an immediate boost by training bigger models next year, but the boost after that will require having TSMC build more GPUs and so will be much slower (and the one after that will require building new fabs and be much slower). I mentioned it in the previous section and do think it’s a major factor explaining why I place a lower premium on slowing down AI than other people. However I think it’s a more important factor for LM agents than for e.g. improving the efficiency of LMs or investing more in hardware.

Understanding of capabilities is valuable

I think that a broad understanding of AI capabilities, and how those capabilities are likely to change over time, would significantly reduce risks:

  • AI developers are less likely to invest adequately in precautions if they underestimate the capabilities of systems they build. For example, debates about AI lab security requirements are often (sensibly) focused on the potential harms from a leak, which are in turn dominated by estimates of the capabilities of current and near-future systems.

  • Dangerous capabilities of AI systems seem like a major driver of policy reactions. Unless we have major warning shots (e.g. an AI system taking over a datacenter) I believe information about capabilities will be an extremely important driver of policy reactions, and will be more central for determining policy than determining investment or researcher interest.

  • Even when AI developers attempt to behave cautiously, underestimating AI capabilities can lead them to fail. For example they may incorrectly believe AI systems can’t distinguish tests from the real world or can’t find a way to undermine human control. This kind of underestimation seems like one of the easiest ways for risk management to break down or for people to incorrectly conclude that their current precautions are adequate.

  • Beyond those specific effects I think our ability to handle risk depends in a bunch of more nebulous ways on how seriously it is taken by the ML research community and researchers at AI labs, and those reactions are tightly coupled. It is also sensitive to the range of defensible positions which can be used to justify reckless policies, and increasing clarity about capabilities makes it harder to defend unreasonable positions.

This factor seems especially large over the next few years, where most risk comes from the possibility that humanity is taken by surprise. I think this is the most important timeframe for individual decisions about sharing information, since the effects of current decisions will be increasingly attenuated over longer horizons.

Over the longer term I think the dangerous capabilities of AI systems will likely be increasingly clear. But I think better understanding still improves how prepared we are and reduces the risk of large surprises.

I think the importance of information about capabilities is pretty robust across worldviews:

  • I’ve laid out the inside view as I see it, which I find fairly compelling.

  • In the broader scientific community I think there is a strong (and in my view correct) presumption that more accurate information tends to reduce risk absent strong arguments to the contrary. This presumption in favor of measurement is somewhat weaker beyond the scientific community, but I think remains the prevailing view.

  • Although the MIRI-sphere has very different background views from mine, wild underestimation of model capabilities tends to play a central role in Eliezer’s stories about how AI goes wrong. (Although I think he would still be opposed to sharing information about capabilities.)

I think that significantly increasing and broadening an understanding of LM capabilities would very significantly decrease risk, but it’s hard to quantify this effect (because it’s hard to measure increases in understanding). Qualitatively, I believe that realistic increases in understanding could cut risk by tens of percent.

Information about capabilities is more impactful for understanding than speed

I think that more accurate information about LM capabilities and limitations can drive faster progress in two big ways:

  • If people are underestimating the competence of models, then correcting their mistake may cause them to invest more.

  • Better evaluations or understandings of limitations could inspire researchers to make more effective progress.

I think these are real effects. But combining with the unquantified estimate in the last section, if I had to make a wild guess I’d say the benefits from sharing information about ML capabilities are 5-10x larger than the costs from acceleration (even without focusing attention on LM agents).

Here are the main reasons why I think this acceleration cost is smaller than you might fear:

  • I think that people correctly estimating AI capabilities earlier will increase investment earlier, but that predictably makes it harder to scale up in the future and therefore slows down progress later.[8] This is part of why my estimate would only be a 15-30% reduction. But more importantly, I think that speed later matters much more than speed now, and so e.g. speeding up by 1 day now and getting back 0.5 days later would probably be a positive trade. As a result I’m ambivalent about the net sign of increasing investment now, and think it’s harmful but much less bad than you might expect.[9] This is the same dynamic discussed in the “Overhang” section above.

  • There are currently a significant number of people who are very excited about scaling up AI as quickly as they can, and these people are already pushing hard enough to start running into significantly diminishing returns from complementary resources like compute, investment, training new scientists, and so on. So the effects of getting more people excited are sublinear. In contrast, policy reactions (both government policy and lab policy) seem more sensitive to consensus and clarity.

  • Sharing information about capabilities disproportionately informs people outside of the field. People working with language models have a much stronger intuitive picture of their capabilities and limitations, such that legible information is particularly valuable to people who don’t work with these systems. The people most responsible for pushing progress faster seem the least sensitive to this information.

  • There are a lot of factors that affect progress, and most of them just can’t have giant impacts. Progress depends on the availability of talented researchers and engineers, time to train them, technical advances in computer hardware, availability of funding at large tech companies and from startup investors, demonstrated commercial demand, quality of research ideas and so on. I think the outsized role of understanding capabilities in addressing risk is exceptional and this is a high bar to try to meet given how many factors are at play. When I hear people expressing the most extreme concerns about acceleration (whether by training new people, contributing new ideas, creating media attention, popularizing a product…), I often feel like the purported sources of variance explain way more than 100% of the variance in the pace of progress.

  • I think that recent progress in LMs has been primarily driven by increasing scale and general improvements in LM efficiency, and has not been particularly sensitive to researchers having a detailed picture of the capabilities and limitations of models. So I think that most of the acceleration effect is flowing through increased interest and investment rather than improvements in research quality.

  • Compared to some people I’m more skeptical about the contingency of innovations like chain of thought or LM agents (and therefore I’m more skeptical about the impact of capabilities understanding that could motivate such work). For example, chain of thought (“inner monologue”) appeared multiple times independently in the wild within 2 months of the GPT-3 release (and was discussed internally within OpenAI before GPT-3 was trained). It appears to have failed to spread more broadly due to a combination of limited access to GPT-3 and not yet working very well. Similarly, tools like LangChain and AutoGPT seem to have caught on before they actually work in practice, and to have been developed and explored several times independently. I think that in practice these kinds of general innovations will usually be eclipsed by domain-specific schlep by people rolling out LM products in specific domains.

I think we should make this decision based on best estimates of costs and benefits

One could have a variety of procedural objections to sharing information even if the benefits appear to exceed the cost. I don’t think these apply strongly, and therefore I think we should make this decision based on object level analysis:

  • I think that improving collective understanding of ML capabilities is a big deal according to many worldviews, and that an attempt to limit public understanding could easily have catastrophic consequences. So this isn’t a matter of comparing serious costs on one worldview to small benefits on another, it’s a decision where both directions have significant stakes and there is no default “safe” option.

  • I think that small groups shouldn’t take unilateral actions contrary to consensus estimates of risk. But in this case I believe that a majority of researchers who think about AI safety are supportive of sharing information (and possibly even a majority of effective altruists, who I expect to be one of the most skeptical groups). And I don’t think there is any clear articulation of the case against sharing such information that engages qualitatively with the kinds of considerations raised in this post. I would become more skeptical about sharing information if I learned that there actually was majority opposition in some relevant community.

  • It may seem suspicious that I am describing so many arguments in favor of sharing information and that all the considerations coincidentally point in the same direction. But this isn’t a coincidence. I was excited about incubating the Evals project, and encouraged them to work on dangerous capability evaluations, and was supportive of them working on evaluating LM agents, because I thought that these activities have large positive effects that easily outweigh downsides. It’s true that in my view this is an unusually lopsided issue, but that’s causally upstream of the decision to work on it rather than being a post hoc rationalization.

  1. ^

    In this post I focus mostly on the risk of AI takeover, because the community worried about takeover is the primary place where I have encountered a widespread belief that measurement of general LM capabilities may be actively counterproductive.

  2. ^

    It’s conceivable that LM agents pose novel risks that are as large or larger than existing threat models—but to the extent that is the case I am if anything even more excited about exploring such agents sooner, and even more skeptical about buying time to e.g. do (apparently-misguided) alignment research today.

  3. ^

    Because ML capabilities researchers will already be seeking out and implementing these improvements

  4. ^

    Most explicitly, see the 2016 discussion of the bootstrapping protocol in ALBA, in which models trained by RLHF solve harder problems by using chain of thought and task decomposition. See also this early 2015 post, which has some distracting simplifications and additional facts, but which presents LM agents in what I would say is essentially the same form that seems most plausible today. This isn’t really related to the thrust of this post and is mostly me just feeling proud of my picture holding up well, but I do think the history here is somewhat relevant to understanding my view—this isn’t something I’m making up now, this is comparing the real world to expectations from many years ago and seeing that LM agents look even more likely to pay a central role.

  5. ^

    Note that this number could be negative. The average across all forms of progress is 0, since accelerating everything by 1 day should decrease timelines by exactly 1 day. I think that areas with high investment are naturally below zero and those with lower investment are naturally above zero, because low-investment areas will expand more easily later. I think probably all software progress and ML-specific investment is above 0, and that improvements in the quantity and quality of compute are well below 0.

  6. ^

    Another reason that time later is more valuable than time now is that AI systems themselves will be doing a large fraction of the cognitive work in the future. But this consideration cancels out when you do the full analysis, both increasing the value of time later and making it harder to get back time later.

  7. ^

    One counterargument is that almost all the policy value comes from policy research driven primarily by altruists who aren’t significantly more likely to work on AI as risks become more concrete and systems become more capable. I don’t personally find this very plausible—it seems like the quantity of research has in fact increased, and that the quality and relevance of that research has also improved significantly.

  8. ^

    I’ve seen the opposite asserted—that momentum means that accelerating now just accelerates more in the future. I don’t think this issue is completely straightforward and it would be a longer digression to really get to the bottom of it. But right now I feel like on-paper analysis and observations of the last 10 years of AI both point pretty strongly towards this conclusion, and I haven’t really seen the alternative laid out.

  9. ^

    By analogy, it seems to me that if humanity had trained GPT-4 for $250M in 2012, using a larger ML community and a larger number of worse computers, the net effect would be a reduction in risk. Making further progress from that point would be harder and easier to regulate, since scaling up spending would become prohibitively difficult and further ML progress would only be possible with large amounts of labor. On top of that, effective AI populations would be smaller since AI would already be using a much larger fraction of humanity’s computing hardware, further computing scaleup would be increasingly bottlenecked, and an intelligence explosion would plausibly proceed several times more slowly. One could argue that increasing preparedness between 2012 and 2022 was enough to compensate for this factor, but that doesn’t look right to me. I am more ambivalent about the effects of acceleration at this point and think it is negative in expectation, because I think society is now investing much more heavily in trying to understand and adapt to the AI we already have and we’re already on track to scale up through the next 5 orders of magnitude distressingly quickly.