Thanks so much for the effort your putting in this work! It looks particularly relevant to my current interest of understanding the different approximations and questions used in alignment, and what forbids us the Grail of paradigmaticity.
Here is my more concrete feedback
A common approach when setting research agendas in AI Alignment is to be specific, and focus on a threat model. That is, to extrapolate from current work in AI and our theoretical understanding of what to expect, to come up with specific stories for how AGI could cause an existential catastrophe. And then to identify specific problems in current or future AI systems that make these failure modes more likely to happen, and try to solve them now.
Given that AFAIK it’s Rohin who introduced the term in alignment, linking to his corresponding talk might be a good idea. I also like this drawing from his slides, which might clarify the explanation for more visual readers.
While I’m at threat models, you confused me at first because “threat model” always makes me think of “development model”, and so I expected a discussion of seed AI vs Prosaic AI vs Brain-based-AGI vs CAIS vs alternatives. What you do instead is more a discussion of “risk models”, with a mention in passing that the first one traditionally came from the more seed AI development model.
Of course that’s your choice, but neglecting a bunch of development models with a lot of recent work, notably the brain-based AGI model of Steve Byrnes, feels incoherent with the stated aim of the sequence — “mapping out the AI Alignment research landscape”.
And having a specific story to guide what you do can be a valuable source of direction, even if ultimately you know it will be flawed in many ways. Nate Soares makes the case for having a specific but flawed story in general well.
My first reaction when reading this part was “Hum, that doesn’t seem to be exactly what Nate is justifying here”. After rereading the post, I think what disturbed me was my initial reading that you were saying something like “the correctness of a threat model doesn’t matter, you just choose one and do stuff”. Which is not what either you or Nate are saying; instead, it’s that spending all the time waiting for a perfect plan/threat model is less productive than taking the best option available, getting your hands dirty and trying things.
Note that I think there is very much a spectrum between this category and robustly good approaches (a forthcoming post in this sequence). Most robustly good ways to help also address specific threat models, and many ways to address specific threat models feel useful even if that specific threat model is wrong. But I find this a helpful distinction to keep in mind.
This sounds to me like a better defense of threat model thinking, and I would like to read more about your ideas (especially the last two sentences).
When naively considered, this framework often implicitly thinks of intelligence as a mysterious black box that caches out as ‘better able to achieve plans than us’, without much concrete detail. Further, it assumes that all goals would lead to these issues.
I agree with the gist of the paragraph, but “all goals” is an overstatement: both Nick Bostrom and Steve Omohundro note that some goals obviously don’t have power-seeking incentives, like the goal of dying as fast as possible. They say that most goals would have instrumental subgoals, which is the part that Richard Ngo criticizes and Alex Turner formalizes.
Understanding the incentives and goals of the agent, and how the training process can affect these in subtle ways
I feel like you should definitely mention Alex Turner’s work here, where he formalizes Bostrom’s instrumental convergence thesis.
Limited optimization: Many of these problems inherently stem from having a goal-directed utility-maximiser, which will find creative ways to achieve these goals. Can we shift away from this paradigm?
Shouldn’t you include work on impact measures here? For example this survey post and Alex Turner’s sequence.
A particularly concerning special case of the power-seeking concern is inner misalignment. This was an idea that had been floating around MIRI for a while, but was first properly clarified by Evan Hubinger in Risks from Learned Optimization.
Evan is adamant that the paper was done equally by all coauthors, and so should be cited as done by “Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, Scott Garrabrant”.
Sub-Threat model: Inner Alignment
I feel like you’re sticking a bit close to the paper’s case, when there are more compact statements of the problem. Especially with your previous case, you could just say that inner alignment is about justifying power-seeking behavior and treacherous turns in the case where the AI is found by search instead of programmed by hand.
Plausibility of misaligned cognition: It is likely that, in practice, we will end up with networks with misaligned cognition
There’s also an argument that deception is robust once it has been found: making a deceptive model less deceptive would make it do more what it really wants to do, and so have a worse loss, which means it’s not pushed out of deception by SGD.
Better understanding how and when mesa-optimization arises (if it does at all).
One cool topic here is gradient hacking — see for example this recent survey.
Anecdotally, some researchers I respect take this very seriously—it was narrowly rated the most plausible threat model in a recent survey.
I want to note that this scenario looks more normal, which makes me think that by default, anyone would find this more plausible than the Bostrom/Yudkowsky scenario due to normalcy bias. So I tend to cancel this advantage when looking at what scenario people favor.
But this error-correction mechanism may break-down for AI. There are three key factors to analyse here: pace, comprehensibility and lock-in.
I like this decomposition!
So, why would AI make cooperation worse/harder?
At least for Critch’s RAAPs, my understanding is that it’s mostly Pace that makes a difference: the process already exists, but it’s not moving as fast as it could because of the fallibility of humans, because of legislation and restrictions. Replacing humans with AIs in most tasks removes the slow down, and so the process moves faster, towards loss of control.
Thanks so much for the effort your putting in this work! It looks particularly relevant to my current interest of understanding the different approximations and questions used in alignment, and what forbids us the Grail of paradigmaticity.
Here is my more concrete feedback
Given that AFAIK it’s Rohin who introduced the term in alignment, linking to his corresponding talk might be a good idea. I also like this drawing from his slides, which might clarify the explanation for more visual readers.
While I’m at threat models, you confused me at first because “threat model” always makes me think of “development model”, and so I expected a discussion of seed AI vs Prosaic AI vs Brain-based-AGI vs CAIS vs alternatives. What you do instead is more a discussion of “risk models”, with a mention in passing that the first one traditionally came from the more seed AI development model.
Of course that’s your choice, but neglecting a bunch of development models with a lot of recent work, notably the brain-based AGI model of Steve Byrnes, feels incoherent with the stated aim of the sequence — “mapping out the AI Alignment research landscape”.
My first reaction when reading this part was “Hum, that doesn’t seem to be exactly what Nate is justifying here”. After rereading the post, I think what disturbed me was my initial reading that you were saying something like “the correctness of a threat model doesn’t matter, you just choose one and do stuff”. Which is not what either you or Nate are saying; instead, it’s that spending all the time waiting for a perfect plan/threat model is less productive than taking the best option available, getting your hands dirty and trying things.
This sounds to me like a better defense of threat model thinking, and I would like to read more about your ideas (especially the last two sentences).
I agree with the gist of the paragraph, but “all goals” is an overstatement: both Nick Bostrom and Steve Omohundro note that some goals obviously don’t have power-seeking incentives, like the goal of dying as fast as possible. They say that most goals would have instrumental subgoals, which is the part that Richard Ngo criticizes and Alex Turner formalizes.
Oh, awesome resource! Thanks for the link!
I feel like you should definitely mention Alex Turner’s work here, where he formalizes Bostrom’s instrumental convergence thesis.
Shouldn’t you include work on impact measures here? For example this survey post and Alex Turner’s sequence.
Evan is adamant that the paper was done equally by all coauthors, and so should be cited as done by “Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, Scott Garrabrant”.
I feel like you’re sticking a bit close to the paper’s case, when there are more compact statements of the problem. Especially with your previous case, you could just say that inner alignment is about justifying power-seeking behavior and treacherous turns in the case where the AI is found by search instead of programmed by hand.
There’s also an argument that deception is robust once it has been found: making a deceptive model less deceptive would make it do more what it really wants to do, and so have a worse loss, which means it’s not pushed out of deception by SGD.
One cool topic here is gradient hacking — see for example this recent survey.
I want to note that this scenario looks more normal, which makes me think that by default, anyone would find this more plausible than the Bostrom/Yudkowsky scenario due to normalcy bias. So I tend to cancel this advantage when looking at what scenario people favor.
I like this decomposition!
At least for Critch’s RAAPs, my understanding is that it’s mostly Pace that makes a difference: the process already exists, but it’s not moving as fast as it could because of the fallibility of humans, because of legislation and restrictions. Replacing humans with AIs in most tasks removes the slow down, and so the process moves faster, towards loss of control.