Some thoughts on what would make me endorse an AGI lab
I’ve been feeling more positive about “the idea of Anthropic” lately, as distinct from the actual company of Anthropic.
An argument for a safety-focused, science-focused commercial frontier scaling lab
I largely buy the old school LessWrong arguments of instrumental convergence and instrumental opacity that suggest catastrophic misalignment, especially of powerful superintelligences. However, I don’t particularly think that those arguments meet the standard of evidence necessary for the world to implement approximately unprecedented policies like “establish an international treaty that puts a global moratorium on frontier AI development.” [1]
If I were king of the world, those arguments would be sufficient reason to shape the laws of my global monarchy. Specifically, I would institute a policy in which we approach Superintelligence much more slowly and carefully, including, many separate pauses in which we thoroughly test the current models before moving forward with increasing frontier capabilities. But I’m not the king of the world, and I don’t have the affordance to implement nuanced policies that reflect the risks and uncertainties of the situation.
Given the actual governance machinery available, it seems to me that reducing our collective uncertainty about the properties of AI systems is at least helpful, and possibly necessary, for amassing political will behind policies that will prove to be good ex post.
Accordingly, I want more grounding in what kinds of beings the AIs are, to inform my policy recommendations. It is imperative to get a better empirically-grounded understanding of AI behavior.
Some of the experiments for gleaning that understanding require doing many training runs, varying parameters of those training runs, and learning how differences in training lead to various behavioral properties.
As a very simple example, most of the models from across the AI labs have a “favorite animal”. If you ask them “what’s your favorite animal, answer in one word”, almost all of them will answer “octopus” almost all of the time. Why is this? Where in the training process does that behavioral tendency (I’m not sure that it’s appropriate to call it a preference), appear? Do the base models exhibit that behavior, or is it the result of some part of post-training? Having identified where in the training process that bias is introduced, I would want to run variations on the training from that checkpoint onward, and learn which differences in training correlate with changes in this simple behavioral outcome.
“What makes AIs disproportionately answer ‘octopus’ as their favorite animal” is the kind of very simple question that I think we should be able to answer, as part of a general theory of how training shapes behavior. I want to try this basic approach with tons and tons of observed behaviors (including some directly relevant safety properties, like willingness to lie and shutdown-resistance). The goal would be to be able to accurately predict model behaviors, including out-of-distribution behaviors, from the training.
Experiments like these require having access to a whole spectrum of model checkpoints, and the infrastructure to do many varied training runs branching from a given checkpoint. You might even need to go back to 0, and redo pretraining (though hopefully you don’t need to completely redo pretraining, multiple times).
Doing this kind of research requires having the infrastructure and talent for doing model training, and (possibly) a lot of cash to burn on training runs. Depending on how expensive this kind of research needs to be, and on how much you can learn from models that are behind the frontier, you might need to be a frontier scaling lab to do this kind of work.[2]
This makes me more sympathetic to the basic value proposition of Anthropic: developing iteratively more capable AI systems, attending to developing those systems such that they broadly have positive impacts on the world, shipping products to gain revenue and investment, and then investing much of your producer surplus into studying the models and trying to understand them. I can see why I might run more-or-less that plan.
But that does NOT necessarily mean that I am in favor of Anthropic the company as it actually exists.
This prompts me to consider: What would I want to see from an AGI lab, that would cause me to endorse it?
Features that an AGI lab needs to have to win my endorsement
[note: I am only listing what would cause me to be in favor of a hypothetical AGI lab. I’m explicitly not trying to evaluate whether Anthropic, or any other AGI lab, actually meets these requirements.]
The AI lab is seriously making preparations to pause.
Externally, I want their messaging to the public and to policymakers to repeatedly emphasize, “Superintelligence will be transformative to the world, and potentially world-destroying. We don’t confidently know how to build superintelligence safely. We’re attempting to make progress on that. But if we still can’t reliably shape superintelligent motivations, when we’re near to superintelligent capabilities, it will be imperative that all companies pause frontier development (but not applications). If we get to that point, we plan to go to the government and strongly request a global pause on development and global caps on AI capabilities.”
I want the executives of the AI company to say that, over and over, in most of their interviews, and ~all of their testimonies to the government. The above statement should be a big part of their public brand.
The company should try to negotiate with the other labs to get as many as they can to agree to a public statement like the above.
Internally, I want an expectation that “the company might pause at some point” to be part of the cultural DNA.
As part of the onboarding process for each new employee, someone sits down with him or her and says “you need to understand that [Company]’s default plan is to pause AI development at some point in the future. When we do that, the value of your equity might tank.”
It should be a regular topic of conversation amongst the staff “when do we pull the breaks?” It should be on the employee’s minds as a real possibility that they’re preparing for, rather than a speculative exotic timeline that’s fun to talk about.
There is a legibly incentive-aligned process for making the call of if and when it’s time to pause.
For instance, this could be a power invested in the board, or some other governance structure, and not in the executives of the company.
Everyone on that board should be financially disinterested (they don’t own equity in the company), familiar with AI risk threat models, and technically competent to evaluate frontier developments.
The company repeatedly issues explicitly non-binding public statements about the leadership’s current thinking about how to identify dangerous levels of capability (with margin of error).
The company has a reputation for honesty and commitment-keeping.
eg They could implement this proposal from Paul Christiano to make trustworthy public statements.
This does not mean that they need to be universally transparent. They’re allowed to have trade secrets, and to keep information that they think would be bad for the world to publicize.
The company has a broadly good track record of deploying current AIs safely and responsibly, including owning up to and correcting mistakes.
eg no Mecahitlers, good track record on sycophancy, legibly putting in a serious effort into guardrails to prevent present-day harms
[added:] The company has generally extremely high levels of operational security, such that they can realistically prevent other companies and other countries from stealing their research or model weights.
[added:] The company has consistently exhibited good (not necessarily prefect) judgment.
Something that isn’t on this list is that the company pre-declare that they would stop AI development now, if all other leading actors also agreed to stop. Where on the capability curve is a good place to stop is a judgement call, given the scientific value of continued scaling (and as a secondary, but still real consideration, the humanitarian benefit). I don’t currently feel inclined to demand that a company that had otherwise done all of the above tie their hands in that way. Publicly and credibly making this commitment might or might not make a big difference for whether other companies will join in the coordination effort, but I guess that if “we we will most likely need to pause, at some point” is really part of the company’s brand, one of their top recurring talking points, that should do about the same work for moving towards the coordinated equilibrium.
I’m interested in…
arguments that any of the above desiderata are infeasible as stated, because they would be impossible or too costly to implement.
additional desiderata that seem necessary or helpful.
claims that any of the existing AI labs already meet these requirements, or meet them in spirit.
- ^
Though perhaps AI will just be legibly freaky and scary to enough people, that a coalition of a small number of people who buy the arguments and a large number of people who are freaked out by the world changing in ways that are both terrifying and deeply uncomfortable, will be sufficient to produce a notable slowdown, even in spite of the enormous short and medium term profit incentives.
- ^
Those are not forgone conclusions. I would be pretty interested in a company that specialized in training and studying only GPT-4-level models. I weakly guess that we can learn most of what we want to learn about how training impacts behavior from models that are that capable. That would still require tens to hundreds of millions of dollars a year, but probably not billions.
It’s also important that the company does something useful during a pause.
The duration of a pause will likely be limited, and it’s not useful unless we make meaningful technical safety and/or governance progress. How much the company contributes to this depends greatly on quality of leadership.
In particular, I’d like to see them:
Demonstrate good safety research prioritization
Leadership actually has control of the research direction of the company (or there are very thoughtful and well-resourced team leads such that research direction is good by default)
Leadership clearly understands that future hard alignment problems may arise which may require either existing approaches or novel approaches
Demonstrate they would contribute to the pause governance
If the pause is government-imposed there’s a lot the government won’t know, and by default companies will aim to actively mislead them to get the pause to end or gain a competitive advantage.
In case of a pause, be able to spend >50% of their compute on safety, and have >half of their capabilities researchers redirected to meaningful safety projects,
In particular their capabilities staff will not just continue to do capabilities research without technically training frontier models, nor leave for a capabilities startup
If the company is getting close to the pause criteria, shift to >10% safety and do daily, active planning for a pause. This serves at least three purposes:
Good prioritization in sub-pause worlds where 10% safety research is necessary.
Going straight from 2% to 50% safety research will be chaotic and lead to lots of wasted effort.
Making concrete plans in easy-medium worlds makes it more credible they will pause.
No, I think the point of a pause is to create a huge amount of surface area for the whole world to think about what to do with AI, and whether to go further. If you successfully negotiate a multi-year pause, you will end up with an enormous number of people trying to make the systems safer (in part to unlock the economic value if you can do so).
I think it’s a mistake to try to have a frontier model company do all things at the same time. Other people can pick up other parts of the ball, when we have the time to do any of that. If there is a pause many other players will quickly get access to models that would be at the current frontier, and so can do whatever research this lab was doing.
I’m imagining a pause to be much more chaotic than this. How would you get everyone the latest models while allowing them to do safety but not capabilities research? How would people capture economic value from making systems safer under a pause? A pause is already a tremendous achievement; by default it will be fragile and not multi-year.
If we can, the company having attributes that makes it able to successfully implement a multi-year pause is important—maybe willingness to fire 3⁄4 of its staff and turn into an inference provider serving open-source models, or pivot to some other business model.
If they need to do safety research on frontier models, they’re not doing research that has strong generalization onto models arbitrarily stronger than next-gen. The point of a pause is to squeeze a few more hours of “people doing work on potentially-strongly-generalizing training setups” out of the world.
Not clear it’s tractable to do safety research that will generalize this well without access to frontier models, for several reasons:
Most of the research will be automated, unless the pause is soon, and then any kind of research will go much faster when done by frontier models.
Technologies might be different in the future. It might be like trying to increase the reliability of the Boeing 787′s jet engines while only having access to WWII-era propeller planes.
Safety methodologies as you scale up even if the technology remains the same.
When generalization has say a 25% chance of not working every generation, you need to empirically verify it works on frontier models, or the EV of your research is cut by 25%.
If we won a pause in the first place, it’s more likely we’ll be able to permanently implement a careful scaling regime where all models are extensively tested during training and before deployment
Yes, but also the people who are working in the frontier labs are going to be the people who are best positioned of anyone, in the whole world, to make progress on the time sensitive alignment and control problems. They have the talent, and the expertise, and a few months lead in working with the most capable models. And you’ll probably have private and info hazardous information which is relevant to solving some of the sub-problems.
They’re not responsible for doing everything themselves, but I also if you’re going to make the play of starting a scaling lab to 1) learn more about the nature of the AIs we’re building, 2) try and solve alignment, and 3) advocate for a pause, I think it’s reasonable to assign them moral responsibility for executing well on that plan.
They would be taking a risk with all of our lives, in the hopes of saving all our lives. If you decide to take that risk on your shoulders, you have a responsibility to be more competent than the average person who might have done it instead of you, and to step aside (or step back to a less powerful role) if better people show up.
Man, one day I will get people to understand that “the incentives you are under and have been under” are part of the circumstances that are relevant for evaluating whether you are a good fit for a job.
When frontier labs are pausing, they will be the people who will have the most momentum towards rushing forwards with AGI development. They will have created a culture of scaling, have ready-made deals that would allow them to immediately become extremely powerful and rich if they pushed the frontier, and be most psychologically attached to building extremely powerful AI systems in the near future.
This makes them a much worse place to do safety research (both today and in the future) than many other places. When thinking of institutional design, understanding the appropriate checks and balances and incentives is one of the key components, and I think that lens of analysis suggests that trying to get a frontier lab that both facilitates a pause by being in the room where it happens, and then just pivots seamlessly to using all their resources on alignment successfully, is asking for too much, and trying to get two things that are very hard to get at the same time.
This post is about a hypothetical different lab that has a notably different corporate culture, in which some notable effort was taken to improve the incentives of the decision-makers?
This seems like a plausible take to me. I’m pretty open to “the get-ready-to-pause scaling lab should have one job, which is to get ready to pause and get the world to pause.”
But also, do you imagine the people who work there are just going to retire the day that the initial 6 month pause (with the possibility of renewal) goes into effect? Many of those people will be world class ML researchers who were in this position specifically because of the existential stakes. Definitely lots of them are going to pivot to trying to make progress on the problem (just as many of them are going to keep up the work of maintaining and extending the pause).
I think almost any realistic success here will look like having done it by the skin of their teeth, and most of the effort of the organization should be on maintaining the pause and facilitating other similar coordination. And then my guess is many people should leave and join organizational structures that are better suited to handle the relevant problems (possibly maintaining a lot of the trust and social ties).
Most of these read to me as “make good, competent choices to spend a pause well, taking for granted that the company is realistically committed to a pause.” And they seem like good suggestions to me!
I would hope that a company that had as part of it’s corporate culture “there will maybe/probably come a day where we stop all capability developments”, and was otherwise competent, would make plans like these.
But I think I don’t require these specific points for me to basically feel good endorsing an AI company.
They’re going to have tons more context than I will about the situation, and will have to make a bunch of judgement calls. There will be lots of places where some choice looks obvious from the outside, but doesn’t actually make sense for those who are in the loop.
I don’t want to withhold an endorsement because they don’t do some specific things that seem like good things to me. But I do want withhold an endorsement if they’re not doing some specific legible things that seem to me to be deontologically necessary for a company that is doing the default-evil thing of adding fire to the AI capabilities race.
That said, there is maybe a big missing thing on my list which is “the company generally seems to exhibit good judgement, such that I can trust them to make reasonable calls about extremely important questions.
I agree with this one. The pause has to be an actual pause of capabilities progress, not just a nominal pause of capability progress.
Oh! An important thing that I forgot: The company has generally extremely high operational security, such that they can prevent other companies and other countries from stealing their research or model weights.
It would be interesting if you gave OpenAI, Google DeepMind, Anthropic, xAI and DeepSeek scores based on how well they fit your checklist.
My guess is it wouldn’t be that interesting because they all cleanly fail in every category, except Anthropic, which could argue for partial credit in a few categories (but not full credit in any of them, and Eli seems to have picked his wording such that partial credit is close to 0 credit on each question).
An AGI/ASI pause doesn’t have to be a total AI pause. You can keep developing better protein-folding prediction AI, better self-driving car AI, etc. All the narrow-ish sorts of AI that are extremely unlikely to super-intelligently blow up in your face. Maybe there are insights gleaned from the lab’s past AGI-ward research that are applicable to narrow AI. Maybe you could also work on developing better products with existing tech. You just want to pause plausibly AGI-ward capabilities research. (Plausibly )
(It’s very likely that some of the stuff that I would now consider “extremely unlikely to super-intelligently blow up in your face” would actually have a good chance of superintelligently blowing up in your face. So what sort of research is safe to do in that regard should also be an area of research.)
(There’s also the caveat consideration that those narrow applications can find some pieces for building AGI.)
This sort of pivot might make the prospects of pausing more palatable to other companies. Plausible that that’s what you had in mind all along, but I think it’s better for this to be very explicit.