I’m very unconfident in the following but, to sketch my intuition:
I don’t really agree with the idea of serial alignment progress that is independent from capability progress. This is what I was trying to get at with
“AI capabilities” and “AI alignment” are highly related to each other, and “AI capabilities” has to come first in that alignment assumes that there is a system to align.
By analogy, nuclear fusion safety research is inextricable from nuclear fusion capability research.
When I try to think of ways to align AI my mind points towards questions like “how do we get an AI to extrapolate concepts? How will it be learning? What will its architecture be?” etc. In other words it just points towards capabilities questions. Since alignment turns on capability questions that we don’t yet have an answer to, it doesn’t surprise me when many alignment researchers seem to spin their wheels and turn to doom and gloom—that’s more or less what I had thought would happen.
As an example of the blurred lines between capability and alignment: while I think it’s useful to have specific terms for inner and outer alignment, I also think that really anyone who worked with RL in a situation where they were manually setting the reward function was aware of these ideas already on some level. “Sometimes I mess up the reward function” and “sometimes the agent isn’t optimizing properly” are both issues encountered frequently. Basically while many people in the alignment community seem to think of alignment as something that is cooked up entirely separately from capability research I tend to think that a lot of it will develop naturally as part of day-to-day AI research with no specific focus on alignment.
As a thought experiment, let’s say that about 20% of current AI capability researchers are very concerned about AI alignment and get together to decide what to do for the next five years. They’re deciding between taking the stance “Capability work is fine right now! Go for it! Worry about alignment when we’re farther along!” or “Let’s get out of capability and go into alignment instead. Capability research is dangerous and burning precious time.” What’s the impact of adopting these two positions?
The first is roughly the default position, and I’d expect that basically what we’ll see is AGI in the year 20XX and that in the runup to this we’ll see vastly increased interest in alignment work and also a significant blurring between “alignment” and “regular AI research” since people want their home robots to not roll over their cat. We’ll also see all major AI research orgs and the AI community as a whole take existential risk from self-improving AGI a lot more seriously once modern SOTA AI systems start looking more and more like the kind of thing that could do that. Because of this there’ll be a concerted effort to handle the situation appropriately which has a good chance of success.
Option two involves slowing down the timeline by about 5-10%. Cutting the size of a field by 20% doesn’t slow progress that much since there’s diminishing returns to adding more researchers, and on top of that AI capability research is only half of what drives progress (the other half being compute). In return for this small slowdown the AI researchers who are now going into alignment will initially spin their wheels due to the lack of anything concrete to focus on or any concrete knowledge of what the future systems will look like. When AGI does start approaching the remaining AI capability community will take it much less seriously due to having been selected specifically for that trait. Three years before the arrival of transformative AGI alignment research is further along than it otherwise would have been, but AI capability researchers have gotten used to tuning alignment researchers out and there aren’t alignment-sympathetic colleagues around to say “hey, given how things are progressing I think it’s time we start taking all that AI risk stuff seriously”. Prospects are worse than option one.
So right now my intuition is that I think alignment will be very doable as long as it’s something that the AI community is taking seriously in the few years leading up to transformative AGI. The biggest risk seems to me to be some AI researchers at one of the leading research groups thinking “man, it sure would be cool if we could use the latest coding LLM combined with RL to make an AI that could improve itself in order to accomplish a goal” and set it running without it ever occuring to them that this could go wrong. Given this, the suggestion that everyone concerned about alignment basically cedes the whole field of AI research (outside of this specific community, “AI capability research” is just called “AI research”) to people who aren’t worried about it seems like a bad idea.
Yeah, that might be a big idea. If you’re right that AI capabilities work and AI Alignment work is the same thing, the problem is solved by definition. So if I’m getting at things correctly, capabilities and safety are highly correlated, and there can’t be situations where capabilities and alignment decouple.
So if I’m getting at things correctly, capabilities and safety are highly correlated, and there can’t be situations where capabilities and alignment decouple.
Not that far, more like it doesn’t decouple until more progress has been made. Pure alignment is an advanced subtopic of AI research that requires more progress to have been made before it’s a viable field.
I’m not super confident in the above and wouldn’t discourage people from doing alignment work now (plus the obvious nuance that it’s not one big lump, there are some things that can be done later and some that can be done earlier) but the idea of alignment work that requires a whole bunch of work in serial, independent of AI capability work, doesn’t seem plausible to me. From Nate Soares’ post:
The most blatant case of alignment work that seems serial to me is work that requires having a theoretical understanding of minds/optimization/whatever, or work that requires having just the right concepts for thinking about minds.
This is the kind of thing that seems inextricably bound up with capability work to me. My impression is that MIRI tends to think that whatever route we take to get to AGI, as it moves from subhuman to human-level intelligence it will transform to be like the minds that they theorize about (and they think this will happen before it goes foom) no matter how different it was when it started. So even if they don’t know what a state of the art RL agent will look like five years from now, they feel confident they can theorize about what it will look like ten years from now. Whereas my view is that if you can’t get the former right you won’t get the latter right either.
To the extent that intelligences will converge towards a certain optimal way of thinking as they get smarter, being able to predict what that looks like will involve a lot of capability work (“Hmm, maybe it will learn like this; let’s code up an agent that learns that way and see how it does”). If you’re not grounding your work in concrete experiments you will end up with mistakes in your view of what an optimal agent looks like and no way to fix them.
A big part of my view is that we seem to still be a long way from AGI. This hinges on how “real” the intelligence behind LLMs is. If we have to take the RL route then we are a long way away—I wrote a piece on this, “What Happened to AIs Learning Games from Pixels?”, which points out how slow the progress has been and covers the areas where the field is stuck. On the other hand if we can get most of the way to AGI just with massive self-supervised training then it starts seeming more likely that we’ll walk into AGI without having a good understanding of what’s going on. I think that the failure of VPT for minecraft compared to GPT for language, and the difficulty LLMs have with extrapolation and innovation, means that self-supervised learning won’t be enough without more insight. I’ll be paying close attention to how GPT-4 and other LLMs do over the next few years to see if they’re making progress faster than I thought, but I talked to chatGPT and it was way worse than I thought it’d be.
I like your comments, 307th, and your linked post on RL SotA. I don’t agree with everything you say, but I some of what you say is quite on point. In particular I agree that ‘RL is currently being rather unimpressive in achieving complicated goals in complex wide-possible-action-space simulation worlds’. I agree that some fundamental breakthroughs are needed to change this, not just scaling existing methods. I disagree that such breakthroughs will necessarily require many calendar years of research. I think probably the eyes of the big research labs will soon be turning to focus more fully upon tackling complex-world RL, and that it won’t be long at all before significant breakthroughs start being made.
I think rather than thinking about research progress in terms of years, or even ‘researcher hours’, it’s more helpful to think of progress in terms of ‘research points’ devoted to the specific topic. An hour of a highly effective researcher at a well-funded lab, with a well-setup research environment that makes new experiments easy to run is worth vastly more ‘research points’ towards a topic than an hour of a compute-limited grad student without polished experiment-running code patterns, without access to huge compute resources, and without much experience running large experiments over many variables.
I’m very unconfident in the following but, to sketch my intuition:
I don’t really agree with the idea of serial alignment progress that is independent from capability progress. This is what I was trying to get at with
By analogy, nuclear fusion safety research is inextricable from nuclear fusion capability research.
When I try to think of ways to align AI my mind points towards questions like “how do we get an AI to extrapolate concepts? How will it be learning? What will its architecture be?” etc. In other words it just points towards capabilities questions. Since alignment turns on capability questions that we don’t yet have an answer to, it doesn’t surprise me when many alignment researchers seem to spin their wheels and turn to doom and gloom—that’s more or less what I had thought would happen.
As an example of the blurred lines between capability and alignment: while I think it’s useful to have specific terms for inner and outer alignment, I also think that really anyone who worked with RL in a situation where they were manually setting the reward function was aware of these ideas already on some level. “Sometimes I mess up the reward function” and “sometimes the agent isn’t optimizing properly” are both issues encountered frequently. Basically while many people in the alignment community seem to think of alignment as something that is cooked up entirely separately from capability research I tend to think that a lot of it will develop naturally as part of day-to-day AI research with no specific focus on alignment.
As a thought experiment, let’s say that about 20% of current AI capability researchers are very concerned about AI alignment and get together to decide what to do for the next five years. They’re deciding between taking the stance “Capability work is fine right now! Go for it! Worry about alignment when we’re farther along!” or “Let’s get out of capability and go into alignment instead. Capability research is dangerous and burning precious time.” What’s the impact of adopting these two positions?
The first is roughly the default position, and I’d expect that basically what we’ll see is AGI in the year 20XX and that in the runup to this we’ll see vastly increased interest in alignment work and also a significant blurring between “alignment” and “regular AI research” since people want their home robots to not roll over their cat. We’ll also see all major AI research orgs and the AI community as a whole take existential risk from self-improving AGI a lot more seriously once modern SOTA AI systems start looking more and more like the kind of thing that could do that. Because of this there’ll be a concerted effort to handle the situation appropriately which has a good chance of success.
Option two involves slowing down the timeline by about 5-10%. Cutting the size of a field by 20% doesn’t slow progress that much since there’s diminishing returns to adding more researchers, and on top of that AI capability research is only half of what drives progress (the other half being compute). In return for this small slowdown the AI researchers who are now going into alignment will initially spin their wheels due to the lack of anything concrete to focus on or any concrete knowledge of what the future systems will look like. When AGI does start approaching the remaining AI capability community will take it much less seriously due to having been selected specifically for that trait. Three years before the arrival of transformative AGI alignment research is further along than it otherwise would have been, but AI capability researchers have gotten used to tuning alignment researchers out and there aren’t alignment-sympathetic colleagues around to say “hey, given how things are progressing I think it’s time we start taking all that AI risk stuff seriously”. Prospects are worse than option one.
So right now my intuition is that I think alignment will be very doable as long as it’s something that the AI community is taking seriously in the few years leading up to transformative AGI. The biggest risk seems to me to be some AI researchers at one of the leading research groups thinking “man, it sure would be cool if we could use the latest coding LLM combined with RL to make an AI that could improve itself in order to accomplish a goal” and set it running without it ever occuring to them that this could go wrong. Given this, the suggestion that everyone concerned about alignment basically cedes the whole field of AI research (outside of this specific community, “AI capability research” is just called “AI research”) to people who aren’t worried about it seems like a bad idea.
Yeah, that might be a big idea. If you’re right that AI capabilities work and AI Alignment work is the same thing, the problem is solved by definition. So if I’m getting at things correctly, capabilities and safety are highly correlated, and there can’t be situations where capabilities and alignment decouple.
Not that far, more like it doesn’t decouple until more progress has been made. Pure alignment is an advanced subtopic of AI research that requires more progress to have been made before it’s a viable field.
I’m not super confident in the above and wouldn’t discourage people from doing alignment work now (plus the obvious nuance that it’s not one big lump, there are some things that can be done later and some that can be done earlier) but the idea of alignment work that requires a whole bunch of work in serial, independent of AI capability work, doesn’t seem plausible to me. From Nate Soares’ post:
This is the kind of thing that seems inextricably bound up with capability work to me. My impression is that MIRI tends to think that whatever route we take to get to AGI, as it moves from subhuman to human-level intelligence it will transform to be like the minds that they theorize about (and they think this will happen before it goes foom) no matter how different it was when it started. So even if they don’t know what a state of the art RL agent will look like five years from now, they feel confident they can theorize about what it will look like ten years from now. Whereas my view is that if you can’t get the former right you won’t get the latter right either.
To the extent that intelligences will converge towards a certain optimal way of thinking as they get smarter, being able to predict what that looks like will involve a lot of capability work (“Hmm, maybe it will learn like this; let’s code up an agent that learns that way and see how it does”). If you’re not grounding your work in concrete experiments you will end up with mistakes in your view of what an optimal agent looks like and no way to fix them.
A big part of my view is that we seem to still be a long way from AGI. This hinges on how “real” the intelligence behind LLMs is. If we have to take the RL route then we are a long way away—I wrote a piece on this, “What Happened to AIs Learning Games from Pixels?”, which points out how slow the progress has been and covers the areas where the field is stuck. On the other hand if we can get most of the way to AGI just with massive self-supervised training then it starts seeming more likely that we’ll walk into AGI without having a good understanding of what’s going on. I think that the failure of VPT for minecraft compared to GPT for language, and the difficulty LLMs have with extrapolation and innovation, means that self-supervised learning won’t be enough without more insight. I’ll be paying close attention to how GPT-4 and other LLMs do over the next few years to see if they’re making progress faster than I thought, but I talked to chatGPT and it was way worse than I thought it’d be.
I like your comments, 307th, and your linked post on RL SotA. I don’t agree with everything you say, but I some of what you say is quite on point. In particular I agree that ‘RL is currently being rather unimpressive in achieving complicated goals in complex wide-possible-action-space simulation worlds’. I agree that some fundamental breakthroughs are needed to change this, not just scaling existing methods. I disagree that such breakthroughs will necessarily require many calendar years of research. I think probably the eyes of the big research labs will soon be turning to focus more fully upon tackling complex-world RL, and that it won’t be long at all before significant breakthroughs start being made.
I think rather than thinking about research progress in terms of years, or even ‘researcher hours’, it’s more helpful to think of progress in terms of ‘research points’ devoted to the specific topic. An hour of a highly effective researcher at a well-funded lab, with a well-setup research environment that makes new experiments easy to run is worth vastly more ‘research points’ towards a topic than an hour of a compute-limited grad student without polished experiment-running code patterns, without access to huge compute resources, and without much experience running large experiments over many variables.