One factor is how compute-constrained capabilities research is compared to safety research.
There’s been lots of ink spilled recently on the degree to which capabilities research is compute-bottlenecked in a scenario where you have a surge of AI research labor caused by automated AI researchers, but everyone agrees that capability research requires a ton of compute. For safety research, I don’t feel well informed about how important compute is, but I think a lot of safety research being done right now is not very compute-intensive. I’m not aware off the top of my head of safety work that involve doing experiments approaching the scale of training a frontier model. So you could have a scenario where the large majority of resources is going towards capabilities, but the big increase in safety labor in absolute terms differentially helps safety.
As a counter example to the idea that safety work isn’t compute constrained, here is a quote from an interpretability paper out of Anthropic, “Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet” :
We don’t have an estimate of how many features there are or how we’d know we got all of them (if that’s even the right frame!). We think it’s quite likely that we’re orders of magnitude short, and that if we wanted to get all the features – in all layers! – we would need to use much more compute than the total compute needed to train the underlying models.
It seems like there probably are a lot of other compute-expensive experiments that would be helpful to run if safety compute was cheap for whatever reason.
When AI companies have human-level AI systems, will they use them for alignment research, or will they use them (mostly) to advance capabilities instead?
It’s not clear this is a crux for the automating alignment research plan to work out.
In particular, suppose an AI company currently spends 5% of its resources on alignment research and will continue spending 5% when they have human level systems. You might think this suffices for alignment to keep pace with capabilities as the alignment labor force will get more powerful as alignment gets more difficult (and more important) due to higher levels of capability.
This doesn’t mean this plan will necessarily work, it depends on the relative difficulty of advancing capabilities vs alignment. I’d naively guess that the probability of success just keeps going up the more resources you use for alignment.
There are some reasons for thinking automation of labor is particularly compelling in the alignment case relative to the capabilities case:
There might be scalable solutions to alignment which effectively indefinitely resolve the research problem while expect that capabilities looks more like continuously making better and better algorithms.
Safety research might benefit relatively more from labor (rather than compute) when compared to capabilities. Two reasons for this:
Safety currently seems relatively more labor bottlenecked.
We can in principle solve large fraction of safety/alignment with fully theoretical safety research without any compute while it seems harder to do purely theoretical capabilities research.
I do think that pausing further capabilities once we have human-ish-level AIs for even just a few years while we focus on safety would massively improve the situation. This currently seems unlikely to happen.
Another way to put this is that automating alignment research is a response in the following dialogue:
Bob: We won’t have enough time to solve alignment because AI takeoff will go very fast due to AIs automating AI R&D (and AI labor generally accelerating AI progress through other mechanisms).
Alice: Actually, as AIs are accelerating AI R&D, they could also be accelerating alignment work, so it’s not clear that accelerating AI progress due to AI R&D acceleration makes the situation very different. As AI progress speeds up, alignment progress might speed up by a similar amount. Or it could speed up by a greater amount due to compute bottlenecks hitting capabilities harder.
There are some reasons for thinking automation of labor is particularly compelling in the alignment case relative to the capabilities case:
It always seems to me the free variable here is why the lab would value spending X% on alignment. For example, you could have the model that “labs will only allocate compute for alignment insofar as it is hampering capabilities progress”. While this would be a nonzero amount, it seems like the failure modes in this regime would be related to alignment research never getting allocated some fixed compute to use for making arbitrary progress, the progress is essentially bottlenecked on “how much misalignment is legibly impeding capabilities work”.
I wrote several attempts at a reply and deleted them all because none of them were cruxes for me. I went for a walk and thought more deeply about my cruxes.
I am now back from my walk. Here is what I have determined:
No reply I could write would be cruxy because my original post is not cruxy with respect to my personal behavior.
I believe the correct thing for me to do is to advocate for slowing down AI development, and to donate to orgs that cost-effectively advocate for slowing down AI development. And my post is basically irrelevant to why I believe that.
So why did I write the post? When I wrote it, I wasn’t thinking about cruxes. It was just an argument I had been thinking about that I’d never read before, and I thought someone ought to write it out.
And I’m not sure exactly who this post is a crux for. Perhaps if someone had a particular combination of beliefs about
the probability that slowing down AI development will work
the probability that bootstrapped alignment will work
where they’re teetering on the edge between “slowing down AI development is good” and “slowing down AI development is bad because it prevents bootstrapped alignment from happening”. My argument might shift that person from the second position to the first. I don’t know if any such person exists.
This is most relevant to slowing down AI development at particular companies—say, if DeepMind slows down and gets significantly surpassed by Meta, then Meta will probably do something that’s even less likely to work than bootstrapped alignment. But a global coordinated slowdown—which is my preferred outcome—does not replace bootstrapped alignment with a worse alignment strategy.
Even though it’s not cruxy, I feel like I should give an object-level response to your comment:
I agree with the denotation of your comment because it is well-hedged—I agree that 5% of resources might be enough to solve alignment. But it probably won’t be.
I think my biggest concern isn’t that AI alignment has no scalable solutions (I agree with you that it probably does have them); my concern is more that alignment is likely to be too hard / get outpaced by capabilities and we will have ASI before alignment is solved.
We can in principle solve large fraction of safety/alignment with fully theoretical safety research without any compute while it seems harder to do purely theoretical capabilities research.
Not to say I disagree (my intuition is that theoretical approaches are underrated), but this contradicts AI companies’ plans (or at least Anthropic’s). Anthropic has claimed that they need to build frontier AI systems on which to do safety research. They seem to think they can’t solve alignment with theoretical approaches. More broadly, if they’re correct, then it seems to me (although it’s not a straightforward contradiction) that alignment bootstrapping won’t have significant advantages to scale because they will need increasing amounts of compute for alignment-related experiments.
FWIW I think your point is more reasonable than Anthropic’s position (I wrote some relevant stuff here). But I thought it was worthwhile to point out the contradiction.
I think something like “alignment features” are plausibly a huge part of the story for why AI goes well.
At least, I think it is refreshing to take the x-risk goggles off for a second sometimes and remember that there is actually a huge business incentive to eg. solve “indirect prompt injections”, perfect robust AI decision making in high stakes contexts, or find the holy grail of compute-scalable oversight.
Like, a lot of times there seems genuine ambiguity and overlap b/w “safety” research and normal AI research. The clean “capabilities”/”alignment” distinction is more map than territory sometimes.
Also, isn’t this already basically a thing?? Companies already compete to have the “special sauce”; a lot of this is post-training stuff, so massively overlaps with “alignment-coded” stuff. When does the RL post training go from being safety to “special sauce” to “alignment feature” y’know?
unfortunately, AI research is commercialized and is heavily skewed by capitalist market needs,
so it’s still going to be all in for tryinng to make an “AI office worker”, safety be damned, until this effors hit wome wall, which I think is still plausible.
One factor is how compute-constrained capabilities research is compared to safety research.
There’s been lots of ink spilled recently on the degree to which capabilities research is compute-bottlenecked in a scenario where you have a surge of AI research labor caused by automated AI researchers, but everyone agrees that capability research requires a ton of compute. For safety research, I don’t feel well informed about how important compute is, but I think a lot of safety research being done right now is not very compute-intensive. I’m not aware off the top of my head of safety work that involve doing experiments approaching the scale of training a frontier model. So you could have a scenario where the large majority of resources is going towards capabilities, but the big increase in safety labor in absolute terms differentially helps safety.
As a counter example to the idea that safety work isn’t compute constrained, here is a quote from an interpretability paper out of Anthropic, “Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet” :
It seems like there probably are a lot of other compute-expensive experiments that would be helpful to run if safety compute was cheap for whatever reason.
It’s not clear this is a crux for the automating alignment research plan to work out.
In particular, suppose an AI company currently spends 5% of its resources on alignment research and will continue spending 5% when they have human level systems. You might think this suffices for alignment to keep pace with capabilities as the alignment labor force will get more powerful as alignment gets more difficult (and more important) due to higher levels of capability.
This doesn’t mean this plan will necessarily work, it depends on the relative difficulty of advancing capabilities vs alignment. I’d naively guess that the probability of success just keeps going up the more resources you use for alignment.
There are some reasons for thinking automation of labor is particularly compelling in the alignment case relative to the capabilities case:
There might be scalable solutions to alignment which effectively indefinitely resolve the research problem while expect that capabilities looks more like continuously making better and better algorithms.
Safety research might benefit relatively more from labor (rather than compute) when compared to capabilities. Two reasons for this:
Safety currently seems relatively more labor bottlenecked.
We can in principle solve large fraction of safety/alignment with fully theoretical safety research without any compute while it seems harder to do purely theoretical capabilities research.
I do think that pausing further capabilities once we have human-ish-level AIs for even just a few years while we focus on safety would massively improve the situation. This currently seems unlikely to happen.
Another way to put this is that automating alignment research is a response in the following dialogue:
Bob: We won’t have enough time to solve alignment because AI takeoff will go very fast due to AIs automating AI R&D (and AI labor generally accelerating AI progress through other mechanisms).
Alice: Actually, as AIs are accelerating AI R&D, they could also be accelerating alignment work, so it’s not clear that accelerating AI progress due to AI R&D acceleration makes the situation very different. As AI progress speeds up, alignment progress might speed up by a similar amount. Or it could speed up by a greater amount due to compute bottlenecks hitting capabilities harder.
It always seems to me the free variable here is why the lab would value spending X% on alignment. For example, you could have the model that “labs will only allocate compute for alignment insofar as it is hampering capabilities progress”. While this would be a nonzero amount, it seems like the failure modes in this regime would be related to alignment research never getting allocated some fixed compute to use for making arbitrary progress, the progress is essentially bottlenecked on “how much misalignment is legibly impeding capabilities work”.
I wrote several attempts at a reply and deleted them all because none of them were cruxes for me. I went for a walk and thought more deeply about my cruxes.
I am now back from my walk. Here is what I have determined:
No reply I could write would be cruxy because my original post is not cruxy with respect to my personal behavior.
I believe the correct thing for me to do is to advocate for slowing down AI development, and to donate to orgs that cost-effectively advocate for slowing down AI development. And my post is basically irrelevant to why I believe that.
So why did I write the post? When I wrote it, I wasn’t thinking about cruxes. It was just an argument I had been thinking about that I’d never read before, and I thought someone ought to write it out.
And I’m not sure exactly who this post is a crux for. Perhaps if someone had a particular combination of beliefs about
the probability that slowing down AI development will work
the probability that bootstrapped alignment will work
where they’re teetering on the edge between “slowing down AI development is good” and “slowing down AI development is bad because it prevents bootstrapped alignment from happening”. My argument might shift that person from the second position to the first. I don’t know if any such person exists.
This is most relevant to slowing down AI development at particular companies—say, if DeepMind slows down and gets significantly surpassed by Meta, then Meta will probably do something that’s even less likely to work than bootstrapped alignment. But a global coordinated slowdown—which is my preferred outcome—does not replace bootstrapped alignment with a worse alignment strategy.
Even though it’s not cruxy, I feel like I should give an object-level response to your comment:
I agree with the denotation of your comment because it is well-hedged—I agree that 5% of resources might be enough to solve alignment. But it probably won’t be.
I think my biggest concern isn’t that AI alignment has no scalable solutions (I agree with you that it probably does have them); my concern is more that alignment is likely to be too hard / get outpaced by capabilities and we will have ASI before alignment is solved.
Not to say I disagree (my intuition is that theoretical approaches are underrated), but this contradicts AI companies’ plans (or at least Anthropic’s). Anthropic has claimed that they need to build frontier AI systems on which to do safety research. They seem to think they can’t solve alignment with theoretical approaches. More broadly, if they’re correct, then it seems to me (although it’s not a straightforward contradiction) that alignment bootstrapping won’t have significant advantages to scale because they will need increasing amounts of compute for alignment-related experiments.
FWIW I think your point is more reasonable than Anthropic’s position (I wrote some relevant stuff here). But I thought it was worthwhile to point out the contradiction.
One possibility is that at some point AI products capabilities would be constrained by compute cost.
At that point, alignment “features” coule become a competitive advantage, so companies would invest much more in alignment.
I think something like “alignment features” are plausibly a huge part of the story for why AI goes well.
At least, I think it is refreshing to take the x-risk goggles off for a second sometimes and remember that there is actually a huge business incentive to eg. solve “indirect prompt injections”, perfect robust AI decision making in high stakes contexts, or find the holy grail of compute-scalable oversight.
Like, a lot of times there seems genuine ambiguity and overlap b/w “safety” research and normal AI research. The clean “capabilities”/”alignment” distinction is more map than territory sometimes.
Also, isn’t this already basically a thing?? Companies already compete to have the “special sauce”; a lot of this is post-training stuff, so massively overlaps with “alignment-coded” stuff. When does the RL post training go from being safety to “special sauce” to “alignment feature” y’know?
unfortunately, AI research is commercialized and is heavily skewed by capitalist market needs,
so it’s still going to be all in for tryinng to make an “AI office worker”, safety be damned, until this effors hit wome wall, which I think is still plausible.