It’s also important that the company does something useful during a pause.
The duration of a pause will likely be limited, and it’s not useful unless we make meaningful technical safety and/or governance progress. How much the company contributes to this depends greatly on quality of leadership.
In particular, I’d like to see them:
Demonstrate good safety research prioritization
Leadership actually has control of the research direction of the company (or there are very thoughtful and well-resourced team leads such that research direction is good by default)
Leadership clearly understands that future hard alignment problems may arise which may require either existing approaches or novel approaches
Demonstrate they would contribute to the pause governance
If the pause is government-imposed there’s a lot the government won’t know, and by default companies will aim to actively mislead them to get the pause to end or gain a competitive advantage.
In case of a pause, be able to spend >50% of their compute on safety, and have >half of their capabilities researchers redirected to meaningful safety projects,
In particular their capabilities staff will not just continue to do capabilities research without technically training frontier models, nor leave for a capabilities startup
If the company is getting close to the pause criteria, shift to >10% safety and do daily, active planning for a pause. This serves at least three purposes:
Good prioritization in sub-pause worlds where 10% safety research is necessary.
Going straight from 2% to 50% safety research will be chaotic and lead to lots of wasted effort.
Making concrete plans in easy-medium worlds makes it more credible they will pause.
No, I think the point of a pause is to create a huge amount of surface area for the whole world to think about what to do with AI, and whether to go further. If you successfully negotiate a multi-year pause, you will end up with an enormous number of people trying to make the systems safer (in part to unlock the economic value if you can do so).
I think it’s a mistake to try to have a frontier model company do all things at the same time. Other people can pick up other parts of the ball, when we have the time to do any of that. If there is a pause many other players will quickly get access to models that would be at the current frontier, and so can do whatever research this lab was doing.
I’m imagining a pause to be much more chaotic than this. How would you get everyone the latest models while allowing them to do safety but not capabilities research? How would people capture economic value from making systems safer under a pause? A pause is already a tremendous achievement; by default it will be fragile and not multi-year.
If we can, the company having attributes that makes it able to successfully implement a multi-year pause is important—maybe willingness to fire 3⁄4 of its staff and turn into an inference provider serving open-source models, or pivot to some other business model.
If they need to do safety research on frontier models, they’re not doing research that has strong generalization onto models arbitrarily stronger than next-gen. The point of a pause is to squeeze a few more hours of “people doing work on potentially-strongly-generalizing training setups” out of the world.
Not clear it’s tractable to do safety research that will generalize this well without access to frontier models, for several reasons:
Most of the research will be automated, unless the pause is soon, and then any kind of research will go much faster when done by frontier models.
Technologies might be different in the future. It might be like trying to increase the reliability of the Boeing 787′s jet engines while only having access to WWII-era propeller planes.
Safety methodologies as you scale up even if the technology remains the same.
When generalization has say a 25% chance of not working every generation, you need to empirically verify it works on frontier models, or the EV of your research is cut by 25%.
If we won a pause in the first place, it’s more likely we’ll be able to permanently implement a careful scaling regime where all models are extensively tested during training and before deployment
Most of the research will be automated, unless the pause is soon, and then any kind of research will go much faster when done by frontier models.
Plausible to me, yes. But my experience with frontier models is that it’s possible-but-hard to prompt them into the kind of work I think is relevant. Perhaps that’s good enough to make your point hold, though. It does weaken my original argument if this line is correct.
Technologies might be different in the future. It might be like trying to increase the reliability of the Boeing 787′s jet engines while only having access to WWII-era propeller planes.
If you’re doing work that isn’t guaranteed to work for any future approach, you’re not doing work that can possibly be durable. At some point in some deployment of some model, the derivative of intelligence has spikes large enough to go through the safety barriers in a way that produces an adversarial glider. That adversarial glider can potentially come up with the least-convenient new technique. You need to be durable in the face of that, which means somehow being sure that you can detect those potential gliders before, and ideally also after, they happen.
To put it another way, what I’m trying to know is whether the thing you’re describing—incrementally doing control work in tandem with capability growth—is qualitatively a durable thing to be doing. That’s what I want a pause to figure out. The kind of work you’re describing is object-level to the kind of work I think needs to be done. Qualitatively, how do you design, in 1930, an aircraft research program that can be known ahead of time to not ever produce a fatality, given that you know you don’t know all the relevant dynamics?
Safety methodologies as you scale up even if the technology remains the same.
I don’t know how to parse this line.
When generalization has say a 25% chance of not working every generation, you need to empirically verify it works on frontier models, or the EV of your research is cut by 25%.
Tell me how you’re going to make the probability of generalization failure quickly asymptote to 0 and I’ll be happy, even if your answer is a description of an ongoing maintenance research program, as long as you can tell me why that research program produces an asymptote to zero. It’s okay if it takes multiple rounds and involves frontier models, it’s okay if it involves ongoing maintenance, but it’s not okay if the meta level of “will my ongoing maintenance become reliable?” is itself an ongoing question.
If we won a pause in the first place, it’s more likely we’ll be able to permanently implement a careful scaling regime where all models are extensively tested during training and before deployment
Agreed. I want to know how to be sure we’re in that regime. That’s what I would call “solving” the alignment problem. Once we can know whether we’re in that regime, I would feel much less at-risk. Right now it looks like we don’t even know what the full list of requirements for that regime are, though we’re a lot closer to knowing than we were even a year ago. I’m hopeful that it can be done.
But if we never can identify reliable meta-level safety principles that have appropriate conditionals so that, if their safeguards ever stop applying, the response is to reliably automatically pause until new safety principles are identified, then I can never feel fully comfortable.
Incidentally, this feels like as fine a time as any to mention I still think scale-down experiments are a critical test bed for all scale-up safety principles.
Yes, but also the people who are working in the frontier labs are going to be the people who are best positioned of anyone, in the whole world, to make progress on the time sensitive alignment and control problems. They have the talent, and the expertise, and a few months lead in working with the most capable models. And you’ll probably have private and info hazardous information which is relevant to solving some of the sub-problems.
They’re not responsible for doing everything themselves, but I also if you’re going to make the play of starting a scaling lab to 1) learn more about the nature of the AIs we’re building, 2) try and solve alignment, and 3) advocate for a pause, I think it’s reasonable to assign them moral responsibility for executing well on that plan.
They would be taking a risk with all of our lives, in the hopes of saving all our lives. If you decide to take that risk on your shoulders, you have a responsibility to be more competent than the average person who might have done it instead of you, and to step aside (or step back to a less powerful role) if better people show up.
Yes, but also the people who are working in the frontier labs are going to be the people who are best positioned of anyone, in the whole world, to make progress on the time sensitive alignment and control problems
Man, one day I will get people to understand that “the incentives you are under and have been under” are part of the circumstances that are relevant for evaluating whether you are a good fit for a job.
When frontier labs are pausing, they will be the people who will have the most momentum towards rushing forwards with AGI development. They will have created a culture of scaling, have ready-made deals that would allow them to immediately become extremely powerful and rich if they pushed the frontier, and be most psychologically attached to building extremely powerful AI systems in the near future.
This makes them a much worse place to do safety research (both today and in the future) than many other places. When thinking of institutional design, understanding the appropriate checks and balances and incentives is one of the key components, and I think that lens of analysis suggests that trying to get a frontier lab that both facilitates a pause by being in the room where it happens, and then just pivots seamlessly to using all their resources on alignment successfully, is asking for too much, and trying to get two things that are very hard to get at the same time.
When frontier labs are pausing, they will be the people who will have the most momentum towards rushing forwards with AGI development. They will have created a culture of scaling, have ready-made deals that would allow them to immediately become extremely powerful and rich if they pushed the frontier, and be most psychologically attached to building extremely powerful AI systems in the near future.
This post is about a hypothetical different lab that has a notably different corporate culture, in which some notable effort was taken to improve the incentives of the decision-makers?
trying to get a frontier lab that both facilitates a pause by being in the room where it happens, and then just pivots seamlessly to using all their resources on alignment successfully, is asking for too much, and trying to get two things that are very hard to get at the same time.
This seems like a plausible take to me. I’m pretty open to “the get-ready-to-pause scaling lab should have one job, which is to get ready to pause and get the world to pause.”
But also, do you imagine the people who work there are just going to retire the day that the initial 6 month pause (with the possibility of renewal) goes into effect? Many of those people will be world class ML researchers who were in this position specifically because of the existential stakes. Definitely lots of them are going to pivot to trying to make progress on the problem (just as many of them are going to keep up the work of maintaining and extending the pause).
But also, do you imagine the people who work there are just going to retire the day that the initial 6 month pause (with the possibility of renewal) goes into effect?
I think almost any realistic success here will look like having done it by the skin of their teeth, and most of the effort of the organization should be on maintaining the pause and facilitating other similar coordination. And then my guess is many people should leave and join organizational structures that are better suited to handle the relevant problems (possibly maintaining a lot of the trust and social ties).
Most of these read to me as “make good, competent choices to spend a pause well, taking for granted that the company is realistically committed to a pause.” And they seem like good suggestions to me!
I would hope that a company that had as part of it’s corporate culture “there will maybe/probably come a day where we stop all capability developments”, and was otherwise competent, would make plans like these.
But I think I don’t require these specific points for me to basically feel good endorsing an AI company.
They’re going to have tons more context than I will about the situation, and will have to make a bunch of judgement calls. There will be lots of places where some choice looks obvious from the outside, but doesn’t actually make sense for those who are in the loop.
I don’t want to withhold an endorsement because they don’t do some specific things that seem like good things to me. But I do want withhold an endorsement if they’re not doing some specific legible things that seem to me to be deontologically necessary for a company that is doing the default-evil thing of adding fire to the AI capabilities race.
That said, there is maybe a big missing thing on my list which is “the company generally seems to exhibit good judgement, such that I can trust them to make reasonable calls about extremely important questions.
In particular their capabilities staff will not just continue to do capabilities research without technically training frontier models, nor leave for a capabilities startup
I agree with this one. The pause has to be an actual pause of capabilities progress, not just a nominal pause of capability progress.
It’s also important that the company does something useful during a pause.
The duration of a pause will likely be limited, and it’s not useful unless we make meaningful technical safety and/or governance progress. How much the company contributes to this depends greatly on quality of leadership.
In particular, I’d like to see them:
Demonstrate good safety research prioritization
Leadership actually has control of the research direction of the company (or there are very thoughtful and well-resourced team leads such that research direction is good by default)
Leadership clearly understands that future hard alignment problems may arise which may require either existing approaches or novel approaches
Demonstrate they would contribute to the pause governance
If the pause is government-imposed there’s a lot the government won’t know, and by default companies will aim to actively mislead them to get the pause to end or gain a competitive advantage.
In case of a pause, be able to spend >50% of their compute on safety, and have >half of their capabilities researchers redirected to meaningful safety projects,
In particular their capabilities staff will not just continue to do capabilities research without technically training frontier models, nor leave for a capabilities startup
If the company is getting close to the pause criteria, shift to >10% safety and do daily, active planning for a pause. This serves at least three purposes:
Good prioritization in sub-pause worlds where 10% safety research is necessary.
Going straight from 2% to 50% safety research will be chaotic and lead to lots of wasted effort.
Making concrete plans in easy-medium worlds makes it more credible they will pause.
No, I think the point of a pause is to create a huge amount of surface area for the whole world to think about what to do with AI, and whether to go further. If you successfully negotiate a multi-year pause, you will end up with an enormous number of people trying to make the systems safer (in part to unlock the economic value if you can do so).
I think it’s a mistake to try to have a frontier model company do all things at the same time. Other people can pick up other parts of the ball, when we have the time to do any of that. If there is a pause many other players will quickly get access to models that would be at the current frontier, and so can do whatever research this lab was doing.
I’m imagining a pause to be much more chaotic than this. How would you get everyone the latest models while allowing them to do safety but not capabilities research? How would people capture economic value from making systems safer under a pause? A pause is already a tremendous achievement; by default it will be fragile and not multi-year.
If we can, the company having attributes that makes it able to successfully implement a multi-year pause is important—maybe willingness to fire 3⁄4 of its staff and turn into an inference provider serving open-source models, or pivot to some other business model.
If they need to do safety research on frontier models, they’re not doing research that has strong generalization onto models arbitrarily stronger than next-gen. The point of a pause is to squeeze a few more hours of “people doing work on potentially-strongly-generalizing training setups” out of the world.
Not clear it’s tractable to do safety research that will generalize this well without access to frontier models, for several reasons:
Most of the research will be automated, unless the pause is soon, and then any kind of research will go much faster when done by frontier models.
Technologies might be different in the future. It might be like trying to increase the reliability of the Boeing 787′s jet engines while only having access to WWII-era propeller planes.
Safety methodologies as you scale up even if the technology remains the same.
When generalization has say a 25% chance of not working every generation, you need to empirically verify it works on frontier models, or the EV of your research is cut by 25%.
If we won a pause in the first place, it’s more likely we’ll be able to permanently implement a careful scaling regime where all models are extensively tested during training and before deployment
Plausible to me, yes. But my experience with frontier models is that it’s possible-but-hard to prompt them into the kind of work I think is relevant. Perhaps that’s good enough to make your point hold, though. It does weaken my original argument if this line is correct.
If you’re doing work that isn’t guaranteed to work for any future approach, you’re not doing work that can possibly be durable. At some point in some deployment of some model, the derivative of intelligence has spikes large enough to go through the safety barriers in a way that produces an adversarial glider. That adversarial glider can potentially come up with the least-convenient new technique. You need to be durable in the face of that, which means somehow being sure that you can detect those potential gliders before, and ideally also after, they happen.
To put it another way, what I’m trying to know is whether the thing you’re describing—incrementally doing control work in tandem with capability growth—is qualitatively a durable thing to be doing. That’s what I want a pause to figure out. The kind of work you’re describing is object-level to the kind of work I think needs to be done. Qualitatively, how do you design, in 1930, an aircraft research program that can be known ahead of time to not ever produce a fatality, given that you know you don’t know all the relevant dynamics?
I don’t know how to parse this line.
Tell me how you’re going to make the probability of generalization failure quickly asymptote to 0 and I’ll be happy, even if your answer is a description of an ongoing maintenance research program, as long as you can tell me why that research program produces an asymptote to zero. It’s okay if it takes multiple rounds and involves frontier models, it’s okay if it involves ongoing maintenance, but it’s not okay if the meta level of “will my ongoing maintenance become reliable?” is itself an ongoing question.
Agreed. I want to know how to be sure we’re in that regime. That’s what I would call “solving” the alignment problem. Once we can know whether we’re in that regime, I would feel much less at-risk. Right now it looks like we don’t even know what the full list of requirements for that regime are, though we’re a lot closer to knowing than we were even a year ago. I’m hopeful that it can be done.
But if we never can identify reliable meta-level safety principles that have appropriate conditionals so that, if their safeguards ever stop applying, the response is to reliably automatically pause until new safety principles are identified, then I can never feel fully comfortable.
Incidentally, this feels like as fine a time as any to mention I still think scale-down experiments are a critical test bed for all scale-up safety principles.
Yes, but also the people who are working in the frontier labs are going to be the people who are best positioned of anyone, in the whole world, to make progress on the time sensitive alignment and control problems. They have the talent, and the expertise, and a few months lead in working with the most capable models. And you’ll probably have private and info hazardous information which is relevant to solving some of the sub-problems.
They’re not responsible for doing everything themselves, but I also if you’re going to make the play of starting a scaling lab to 1) learn more about the nature of the AIs we’re building, 2) try and solve alignment, and 3) advocate for a pause, I think it’s reasonable to assign them moral responsibility for executing well on that plan.
They would be taking a risk with all of our lives, in the hopes of saving all our lives. If you decide to take that risk on your shoulders, you have a responsibility to be more competent than the average person who might have done it instead of you, and to step aside (or step back to a less powerful role) if better people show up.
Man, one day I will get people to understand that “the incentives you are under and have been under” are part of the circumstances that are relevant for evaluating whether you are a good fit for a job.
When frontier labs are pausing, they will be the people who will have the most momentum towards rushing forwards with AGI development. They will have created a culture of scaling, have ready-made deals that would allow them to immediately become extremely powerful and rich if they pushed the frontier, and be most psychologically attached to building extremely powerful AI systems in the near future.
This makes them a much worse place to do safety research (both today and in the future) than many other places. When thinking of institutional design, understanding the appropriate checks and balances and incentives is one of the key components, and I think that lens of analysis suggests that trying to get a frontier lab that both facilitates a pause by being in the room where it happens, and then just pivots seamlessly to using all their resources on alignment successfully, is asking for too much, and trying to get two things that are very hard to get at the same time.
This post is about a hypothetical different lab that has a notably different corporate culture, in which some notable effort was taken to improve the incentives of the decision-makers?
This seems like a plausible take to me. I’m pretty open to “the get-ready-to-pause scaling lab should have one job, which is to get ready to pause and get the world to pause.”
But also, do you imagine the people who work there are just going to retire the day that the initial 6 month pause (with the possibility of renewal) goes into effect? Many of those people will be world class ML researchers who were in this position specifically because of the existential stakes. Definitely lots of them are going to pivot to trying to make progress on the problem (just as many of them are going to keep up the work of maintaining and extending the pause).
I think almost any realistic success here will look like having done it by the skin of their teeth, and most of the effort of the organization should be on maintaining the pause and facilitating other similar coordination. And then my guess is many people should leave and join organizational structures that are better suited to handle the relevant problems (possibly maintaining a lot of the trust and social ties).
Most of these read to me as “make good, competent choices to spend a pause well, taking for granted that the company is realistically committed to a pause.” And they seem like good suggestions to me!
I would hope that a company that had as part of it’s corporate culture “there will maybe/probably come a day where we stop all capability developments”, and was otherwise competent, would make plans like these.
But I think I don’t require these specific points for me to basically feel good endorsing an AI company.
They’re going to have tons more context than I will about the situation, and will have to make a bunch of judgement calls. There will be lots of places where some choice looks obvious from the outside, but doesn’t actually make sense for those who are in the loop.
I don’t want to withhold an endorsement because they don’t do some specific things that seem like good things to me. But I do want withhold an endorsement if they’re not doing some specific legible things that seem to me to be deontologically necessary for a company that is doing the default-evil thing of adding fire to the AI capabilities race.
That said, there is maybe a big missing thing on my list which is “the company generally seems to exhibit good judgement, such that I can trust them to make reasonable calls about extremely important questions.
I agree with this one. The pause has to be an actual pause of capabilities progress, not just a nominal pause of capability progress.