I agree with the basic difference you point to between fast- and slow-takeoff worlds, but disagree that it has important strategic implications for the obviousness of takeover risk.
In slow takeoff worlds, many aspects of the alignment problem show up well before AGI goes critical. However, people will by-default train systems to conceal those problems. (This is already happening: RL from human feedback is exactly the sort of strategy which trains systems to conceal problems, and we’ve seen multiple major orgs embracing it within the past few months.) As a result, AI takeover risk never looks much more obvious than it does now.
Concealed problems look like no problems, so there will in-general be economic incentives to train in ways which conceal problems. The most-successful-looking systems, at any given time, will be systems trained in ways which incentivize hidden problems over visible problems.
I expect that people will find it pretty obvious that RLHF leads to somewhat misaligned systems, if they are widely used by the public. Like, I think that most ML researchers agree that the Facebook Newsfeed algorithm is optimizing for clicks in a way people are somewhat unhappy about, and this is based substantially on their personal experience with it; inasmuch as we’re interacting a lot with sort-of-smart ML systems, I think we’ll notice their slight misalignment. And so I do think that this will make AI takeover risk more obvious.
Examples of small AI catastrophes will also probably make takeover risk more obvious.
I guess another example of this phenomenon is that a bunch of people are more worried about AI takeover than they were five years ago, because they’ve seen more examples of ML systems being really smart, even though they wouldn’t have said five years ago that ML systems could never solve those problems. Seeing the ways things happen is often pretty persuasive to people.
Like, I think that most ML researchers agree that the Facebook Newsfeed algorithm is optimizing for clicks in a way people are somewhat unhappy about, and this is based substantially on their personal experience with it; inasmuch as we’re interacting a lot with sort-of-smart ML systems, I think we’ll notice their slight misalignment.
This prediction feels like… it doesn’t play out the whole game tree? Like, yeah, Facebook releases one algorithm optimizing for clicks in a way people are somewhat unhappy about. But the customers are unhappy about it, which is not an economically-stable state of affairs, so shortly thereafter Facebook switches to a different metric which is less click-centric. (IIRC this actually happened a few years ago.)
On the other hand, sometimes Facebook’s newsfeed algorithm is bad in ways which are not visible to individual customers. Like, maybe there’s an echo chamber problem, people only see things they agree with. But from an individual customer’s perspective, that’s exactly what they (think they) want to see, they don’t know that there’s anything wrong with the information they’re receiving. This sort of problem does not actually look like a problem from the perspective of any one person looking at their own feed; it looks good. So that’s a much more economically stable state; Facebook is less eager to switch to a new metric.
… but even that isn’t a real example of a problem which is properly invisible. It’s still obvious that the echo-chamber-newsfeed is bad for other people, and therefore it will still be noticed, and Facebook will still be pressured to change their metrics. (Indeed that is what happened.) The real problems are problems people don’t notice at all, or don’t know to attribute to the newsfeed algorithm at all. We don’t have a widely-recognized example of such a thing and probably won’t any time soon, precisely because most people do not notice it. Yet I’d be surprised if Facebook’s newsfeed algorithm didn’t have some such subtle negative effects, and I very much doubt that the subtle problems will go away as the visible problems are iterated on.
If anything, I’d expect iterating on visible problems to produce additional subtle problems—for instance, in order to address misinformation problems, Facebook started promoting an Official Narrative which is itself often wrong. But that’s much harder to detect, because it’s wrong in a way which the vast majority of Official Sources also endorse. To put it another way: if most of the population can be dragged into a single echo chamber, all echoing the same wrong information, that doesn’t make the echo chamber problem less bad, but it does make the echo chamber problem less visible.
Anyway, zooming out: solve for the equilibrium, as Cowen would say. If the problems are visible to customers, that’s not a stable state. Organizations will be incentivized to iterate until problems stop being visible. They will not, however, be incentivized to iterate away the problems which aren’t visible.
I can’t tell which of two arguments you’re making: that there are unknown unknowns, or that myopia isn’t a complete solution.
This is a good argument for all metrics being Goodhearteable, and that if takeover occurs and the AI is incorrigible that’ll cause suboptimal value lock-in (Ie unknown unknowns).
I agree myopia isn’t a complete solution, but it seems better for preventing takeover risk than for preventing social media dysfunction? It seems more easily defineable in the worst case (“don’t do something nearly all humans really dislike” than “make the public square function well”).
You can use an LLM to ask what actions to take, or you can use an LLM to ask “hey is this a good world state?” The latter seems like it might capture a lot of human semantics about value given RL4HF
I agree with the basic difference you point to between fast- and slow-takeoff worlds, but disagree that it has important strategic implications for the obviousness of takeover risk.
In slow takeoff worlds, many aspects of the alignment problem show up well before AGI goes critical. However, people will by-default train systems to conceal those problems. (This is already happening: RL from human feedback is exactly the sort of strategy which trains systems to conceal problems, and we’ve seen multiple major orgs embracing it within the past few months.) As a result, AI takeover risk never looks much more obvious than it does now.
Concealed problems look like no problems, so there will in-general be economic incentives to train in ways which conceal problems. The most-successful-looking systems, at any given time, will be systems trained in ways which incentivize hidden problems over visible problems.
I expect that people will find it pretty obvious that RLHF leads to somewhat misaligned systems, if they are widely used by the public. Like, I think that most ML researchers agree that the Facebook Newsfeed algorithm is optimizing for clicks in a way people are somewhat unhappy about, and this is based substantially on their personal experience with it; inasmuch as we’re interacting a lot with sort-of-smart ML systems, I think we’ll notice their slight misalignment. And so I do think that this will make AI takeover risk more obvious.
Examples of small AI catastrophes will also probably make takeover risk more obvious.
I guess another example of this phenomenon is that a bunch of people are more worried about AI takeover than they were five years ago, because they’ve seen more examples of ML systems being really smart, even though they wouldn’t have said five years ago that ML systems could never solve those problems. Seeing the ways things happen is often pretty persuasive to people.
This prediction feels like… it doesn’t play out the whole game tree? Like, yeah, Facebook releases one algorithm optimizing for clicks in a way people are somewhat unhappy about. But the customers are unhappy about it, which is not an economically-stable state of affairs, so shortly thereafter Facebook switches to a different metric which is less click-centric. (IIRC this actually happened a few years ago.)
On the other hand, sometimes Facebook’s newsfeed algorithm is bad in ways which are not visible to individual customers. Like, maybe there’s an echo chamber problem, people only see things they agree with. But from an individual customer’s perspective, that’s exactly what they (think they) want to see, they don’t know that there’s anything wrong with the information they’re receiving. This sort of problem does not actually look like a problem from the perspective of any one person looking at their own feed; it looks good. So that’s a much more economically stable state; Facebook is less eager to switch to a new metric.
… but even that isn’t a real example of a problem which is properly invisible. It’s still obvious that the echo-chamber-newsfeed is bad for other people, and therefore it will still be noticed, and Facebook will still be pressured to change their metrics. (Indeed that is what happened.) The real problems are problems people don’t notice at all, or don’t know to attribute to the newsfeed algorithm at all. We don’t have a widely-recognized example of such a thing and probably won’t any time soon, precisely because most people do not notice it. Yet I’d be surprised if Facebook’s newsfeed algorithm didn’t have some such subtle negative effects, and I very much doubt that the subtle problems will go away as the visible problems are iterated on.
If anything, I’d expect iterating on visible problems to produce additional subtle problems—for instance, in order to address misinformation problems, Facebook started promoting an Official Narrative which is itself often wrong. But that’s much harder to detect, because it’s wrong in a way which the vast majority of Official Sources also endorse. To put it another way: if most of the population can be dragged into a single echo chamber, all echoing the same wrong information, that doesn’t make the echo chamber problem less bad, but it does make the echo chamber problem less visible.
Anyway, zooming out: solve for the equilibrium, as Cowen would say. If the problems are visible to customers, that’s not a stable state. Organizations will be incentivized to iterate until problems stop being visible. They will not, however, be incentivized to iterate away the problems which aren’t visible.
I can’t tell which of two arguments you’re making: that there are unknown unknowns, or that myopia isn’t a complete solution.
This is a good argument for all metrics being Goodhearteable, and that if takeover occurs and the AI is incorrigible that’ll cause suboptimal value lock-in (Ie unknown unknowns).
I agree myopia isn’t a complete solution, but it seems better for preventing takeover risk than for preventing social media dysfunction? It seems more easily defineable in the worst case (“don’t do something nearly all humans really dislike” than “make the public square function well”).
Can you talk more about why RL4HF is “concealing problems”? Do you mean “attempting alignment” in a way that other people won’t, or something else?
Roughly, “avoid your actions being labelled as bad by humans [or models of humans]” is not quite the same signal as “don’t be bad”.
Ah ok, so you’re saying RL4HF is bad if it’s the action model. But it seems fine if it’s done to the reward model, right?
What do you mean by “RLHF is done to the reward model”, and why would that be fine?
You can use an LLM to ask what actions to take, or you can use an LLM to ask “hey is this a good world state?” The latter seems like it might capture a lot of human semantics about value given RL4HF