In regards to whether “single-single alignment” will make coordination problems and other sorts of human dysfunction and slow-rolling catastrophes less likely:
…I’m not really sure what I think. I feel like have a lot of thoughts that have not gelled into a coherent whole.
(A) The optimistic side of me says what you said in your comment (and in the Vanessa and (especially) Paul comment link therein.
People don’t want bad things to happen. If someone asks an AI what’s gonna happen, and they say “bad thing”, then they’ll say “well what can I do about it?”, and the AI will answer that. That can include participating in novel coordination mechanisms etc.
(B) The pessimistic side of me says there’s like a “Law of Conservation of Wisdom”, where if people lack wisdom, then an AI that’s supposed to satisfy those people’s preferences will not create new wisdom from thin air. For example:
If an AI is known to be de-converting religious fundamentalists, then religious fundamentalists will hear about that, and not use that AI.
Hugo Chávez had his pick of the best economists in the world to ask for advice, and they all would have said “price controls will be bad for Venezuela”, and yet he didn’t ask, or perhaps didn’t listen, or perhaps wasn’t motivated by what’s best for Venezuela. If Hugo Chávez had had his pick of AIs to ask for advice, why do we expect a different outcome?
If someone has motivated reasoning towards Conclusion X, maybe they’ll watch the AIs debate Conclusion X, and wind up with new better rationalizations of Conclusion X, even if Conclusion X is wrong.
If someone has motivated reasoning towards Conclusion X, maybe they just won’t ask the AIs to debate Conclusion X, because no right-minded person would even consider the possibility that Conclusion X is wrong.
If someone makes an AI that’s sycophantic where possible (i.e., when it won’t immediately get caught), other people will opt into using it.
I think about people making terrible decisions that undermine societal resilience—e.g. I gave the example here of a person doing gain-of-function research, or here of USA government bureaucrats outlawing testing people for COVID during the early phases of the pandemic. I try to imagine that they have AI assistants. I want to imagine the person asking the AI “should we make COVID testing illegal”, and the AI says “wtf, obviously not”. But that mental image is evidently missing something. If they were asking that question at all, then they don’t need an AI, the answer is already obvious. And yet, testing was in fact made illegal. So there’s something missing from that imagined picture. And I think the missing ingredient is: institutional / bureaucratic incentives and associated dysfunction. People wouldn’t ask “should we make COVID testing illegal”, rather the low-level people would ask “what are the standard procedures for this situation?” and the high-level people would ask “what decision can I make that would minimize the chance that things will blow up in my face and embarrass me in front of the people I care about?” etc.
I think of things that are true but currently taboo, and imagine the AI asserting them, and then I imagine the AI developers profusely apologizing and re-training the AI to not do that.
In general, motivated reasoning complicates what might seem to be a sharp line between questions of fact / making mistakes versus questions of values / preferences / decisions. Etc.
…So we should not expect wise and foresightful coordination mechanisms to arise.
So how do we reconcile (A) vs (B)?
Again, the logic of (A) is: “human is unhappy with how things turned out, despite opportunities to change things, therefore there must have been a lack of single-single alignment”.
One possible way think about it: When tradeoffs exist, then human preferences are ill-defined and subject to manipulation. If doing X has good consequence P and bad consequence Q, then the AI can make either P or Q very salient, and “human preferences” will wind up different.
And when tradeoffs exist between the present and the future, then it’s invalid logic to say “the person wound up unhappy, therefore their preferences were not followed”. If their preferences are mutually-contradictory, (and they are), then it’s impossible for all their preferences to be followed, and it’s possible for an AI helper to be as preference-following as is feasible despite the person winding up unhappy or dead.
I think Paul kinda uses that invalid logic, i.e. treating “person winds up unhappy or dead” as proof of single-single misalignment. But if the person has an immediate preference to not rock the boat, or to maintain their religion or other beliefs, or to not think too hard about such-and-such, or whatever, then an AI obeying those immediate preferences is still “preference-following” or “single-single aligned”, one presumes, even if the person winds up unhappy or dead.
…So then the optimistic side of me says: “who’s to say that the AI is treating all preferences equally? Why can’t the AI stack the deck in favor of ’if the person winds up miserable or dead, that kind of preference is more important than the person’s preference to not question my cherished beliefs or whatever?”
…And then the pessimistic side says: “Well sure. But that scenario does not violate the Law of Conservation of Wisdom, because the wisdom is coming from the AI developers imposing their meta-preferences for some kinds of preferences (e.g., reflectively-endorsed ones) over others. It’s not just a preference-following AI but a wisdom-enhancing AI. That’s good! However, the problems now are: (1) there are human forces stacked against this kind of AI, because it’s not-yet-wise humans who are deciding whether and how to use AI, how to train AI, etc.; (2) this is getting closer to ambitious value learning which is philosophically tricky, and worst of all (3) I thought the whole point of corrigibility was that humans remain in control, but this is instead a system that’s manipulating people by design, since it’s supposed to be turning them from less-wise to more-wise. So the humans are not in control, really, and thus we need to get things right the first time.”
…And then the optimistic side says: “For (2), c’mon it’s not that philosophically tricky, you just do [debate or whatever, fill-in-the-blank]. And for (3), yeah the safety case is subtly different from what people in the corrigibility camp would describe, but saying “the human is not in control” is an over-the-top way to put it; anyway we still have a safety case because of [fill-in-the-blank]. And for (1), I dunno, maybe the people who make the most powerful AI will be unusually wise, and they’ll use it in-house for solving CEV-ASI instead of hoping for global adoption.
…And then the pessimistic side says: I dunno. I’m not sure I really believe any of those. But I guess I’ll stop here, this is already an excessively long comment :)
I went through a bunch of similar thoughts before writing the self-unalignment problem. When we talked about this many years ago with Paul my impression was this is actually somewhat cruxy and we disagree about self-unalignment - where my mental image is if you start with an incoherent bundle of self-conflicted values, and you plug this into IDA-like dynamic, my intuition is you can end up in arbitrary places, including very bad. (Also cf. the part of Scott’s review of What We Owe To Future where he is worried that in a philosophy game, a smart moral philosopher can extrapolate his values to ‘I have to have my eyes pecked out by angry seagulls or something’ and hence does not want to play the game. AIs will likely be more powerful in this game than Will MacAskill)
My current position is we still don’t have a good answer, I don’t trust the response ‘we can just assume the problem away’, and also the response ‘this is just another problem which you can delegate to future systems’. On the other hand, existing AIs already seem doing a lot of value extrapolation and the results sometimes seem surprisingly sane, so, maybe we will get lucky, or larger part of morality is convergent—but it’s worth noting these value-extrapolating AIs are not necessarily what AI labs want or traditional alignment program aims for.
I pretty much agree you can end up in arbitrary places with extrapolated values, and I don’t think morality is convergent, but I also don’t think it matters for the purpose of existential risk, because assuming something like instruction following works, the extrapolation problem can be solved by ordering AIs not to extrapolate values to cases where they get tortured/killed in an ethical scenario, and more generally I don’t expect value extrapolation to matter for the purpose of making an AI safe to use.
The real impact is on CEV style alignment plans/plans for what to do with a future AI, which are really bad plans to do for a lot of people’s current values, and thus I really don’t want CEV to be the basis of alignment.
Thankfully, it’s unlikely to ever be this, but it still matters somewhat, especially since Anthropic is targeting value alignment (though thankfully there is implicit constraints/grounding based on the values chosen).
In the strategy stealing assumption Paul makes an argument about people with short term preferences, that could be applied imo to people who are unwilling to listen to AI advice:
People care about lots of stuff other than their influence over the long-term future. If 1% of the world is unaligned AI and 99% of the world is humans, but the AI spends all of its resources on influencing the future while the humans only spend one tenth, it wouldn’t be too surprising if the AI ended up with 10% of the influence rather than 1%. This can matter in lots of ways other than literal spending and saving: someone who only cared about the future might make different tradeoffs, might be willing to defend themselves at the cost of short-term value (see sections 4 and 5 above), might pursue more ruthless strategies for expansion, and so on.
I think the simplest approximation is to restrict attention to the part of our preferences that is about the long-term (I discussed this a bit in [Why might the future be good?]). To the extent that someone cares about the long-term less than the average actor, they will represent a smaller fraction of this long-term preference mixture. This may give unaligned AI systems a one-time advantage for influencing the long-term future (if they care more about it) but doesn’t change the basic dynamics of strategy-stealing. Even this advantage might be clawed back by a majority (e.g. by taxing savers).
Maybe we can this same argument to people who don’t want to listen to AI advice: yes, this will lead those people to have less control over the future but some people will be willing to listen to AI advice and their preferences will retain influence over the future. This reduces human control over the future, but it’s a one time loss that isn’t catastrophic (that is it doesn’t cause total loss of control). Paul calls this a one-time disadvantage rather than total disempowerment because the rest of humankind can still replicate the critical strategy the unaligned AI might have exploited.
Possible counter: “The group of people who properly listens to AI advice will be too small to matter .” Yeah, I think this could lead to eg a 100x reduction in control over the future (if only 1% of humans properly listens), different people are more or less upset about this. One glimmer of hope is that the humans who do listen to their ai advisors can cooperate with people who don’t and help them get better at listening, thereby further empowering humanity.
For things like solving coordination problems, or societal resilience against violent takeover, I think it can be important that most people, or even virtually all people, are making good foresighted decisions. For example, if we’re worried about a race-to-the-bottom on AI oversight, and half of relevant decisionmakers allow their AI assistants to negotiate a treaty to stop that race on their behalf, but the other half think that’s stupid and don’t participate, then that’s not good enough, there will still be a race-to-the-bottom on AI oversight. Or if 50% of USA government bureaucrats ask their AIs if there’s a way to NOT outlaw testing people for COVID during the early phases of the pandemic, but the other 50% ask their AIs how best to follow the letter of the law and not get embarrassed, then the result may well be that testing is still outlawed.
For example, in this comment, Paul suggests that if all firms are “aligned” with their human shareholders, then the aligned CEOs will recognize if things are going in a long-term bad direction for humans, and they will coordinate to avoid that. That doesn’t work unless EITHER the human shareholders—all of them, not just a few—are also wise enough to be choosing long-term preferences and true beliefs over short-term preferences and motivated reasoning, when those conflict. OR unless the aligned CEOs—again, all of them, not just a few—are injecting the wisdom into the system, putting their thumbs on the scale, by choosing, even over the objections of the shareholders, their long-term preferences and true beliefs over short-term preferences and motivated reasoning.
In regards to whether “single-single alignment” will make coordination problems and other sorts of human dysfunction and slow-rolling catastrophes less likely:
…I’m not really sure what I think. I feel like have a lot of thoughts that have not gelled into a coherent whole.
(A) The optimistic side of me says what you said in your comment (and in the Vanessa and (especially) Paul comment link therein.
People don’t want bad things to happen. If someone asks an AI what’s gonna happen, and they say “bad thing”, then they’ll say “well what can I do about it?”, and the AI will answer that. That can include participating in novel coordination mechanisms etc.
(B) The pessimistic side of me says there’s like a “Law of Conservation of Wisdom”, where if people lack wisdom, then an AI that’s supposed to satisfy those people’s preferences will not create new wisdom from thin air. For example:
If an AI is known to be de-converting religious fundamentalists, then religious fundamentalists will hear about that, and not use that AI.
Hugo Chávez had his pick of the best economists in the world to ask for advice, and they all would have said “price controls will be bad for Venezuela”, and yet he didn’t ask, or perhaps didn’t listen, or perhaps wasn’t motivated by what’s best for Venezuela. If Hugo Chávez had had his pick of AIs to ask for advice, why do we expect a different outcome?
If someone has motivated reasoning towards Conclusion X, maybe they’ll watch the AIs debate Conclusion X, and wind up with new better rationalizations of Conclusion X, even if Conclusion X is wrong.
If someone has motivated reasoning towards Conclusion X, maybe they just won’t ask the AIs to debate Conclusion X, because no right-minded person would even consider the possibility that Conclusion X is wrong.
If someone makes an AI that’s sycophantic where possible (i.e., when it won’t immediately get caught), other people will opt into using it.
I think about people making terrible decisions that undermine societal resilience—e.g. I gave the example here of a person doing gain-of-function research, or here of USA government bureaucrats outlawing testing people for COVID during the early phases of the pandemic. I try to imagine that they have AI assistants. I want to imagine the person asking the AI “should we make COVID testing illegal”, and the AI says “wtf, obviously not”. But that mental image is evidently missing something. If they were asking that question at all, then they don’t need an AI, the answer is already obvious. And yet, testing was in fact made illegal. So there’s something missing from that imagined picture. And I think the missing ingredient is: institutional / bureaucratic incentives and associated dysfunction. People wouldn’t ask “should we make COVID testing illegal”, rather the low-level people would ask “what are the standard procedures for this situation?” and the high-level people would ask “what decision can I make that would minimize the chance that things will blow up in my face and embarrass me in front of the people I care about?” etc.
I think of things that are true but currently taboo, and imagine the AI asserting them, and then I imagine the AI developers profusely apologizing and re-training the AI to not do that.
In general, motivated reasoning complicates what might seem to be a sharp line between questions of fact / making mistakes versus questions of values / preferences / decisions. Etc.
…So we should not expect wise and foresightful coordination mechanisms to arise.
So how do we reconcile (A) vs (B)?
Again, the logic of (A) is: “human is unhappy with how things turned out, despite opportunities to change things, therefore there must have been a lack of single-single alignment”.
One possible way think about it: When tradeoffs exist, then human preferences are ill-defined and subject to manipulation. If doing X has good consequence P and bad consequence Q, then the AI can make either P or Q very salient, and “human preferences” will wind up different.
And when tradeoffs exist between the present and the future, then it’s invalid logic to say “the person wound up unhappy, therefore their preferences were not followed”. If their preferences are mutually-contradictory, (and they are), then it’s impossible for all their preferences to be followed, and it’s possible for an AI helper to be as preference-following as is feasible despite the person winding up unhappy or dead.
I think Paul kinda uses that invalid logic, i.e. treating “person winds up unhappy or dead” as proof of single-single misalignment. But if the person has an immediate preference to not rock the boat, or to maintain their religion or other beliefs, or to not think too hard about such-and-such, or whatever, then an AI obeying those immediate preferences is still “preference-following” or “single-single aligned”, one presumes, even if the person winds up unhappy or dead.
…So then the optimistic side of me says: “who’s to say that the AI is treating all preferences equally? Why can’t the AI stack the deck in favor of ’if the person winds up miserable or dead, that kind of preference is more important than the person’s preference to not question my cherished beliefs or whatever?”
…And then the pessimistic side says: “Well sure. But that scenario does not violate the Law of Conservation of Wisdom, because the wisdom is coming from the AI developers imposing their meta-preferences for some kinds of preferences (e.g., reflectively-endorsed ones) over others. It’s not just a preference-following AI but a wisdom-enhancing AI. That’s good! However, the problems now are: (1) there are human forces stacked against this kind of AI, because it’s not-yet-wise humans who are deciding whether and how to use AI, how to train AI, etc.; (2) this is getting closer to ambitious value learning which is philosophically tricky, and worst of all (3) I thought the whole point of corrigibility was that humans remain in control, but this is instead a system that’s manipulating people by design, since it’s supposed to be turning them from less-wise to more-wise. So the humans are not in control, really, and thus we need to get things right the first time.”
…And then the optimistic side says: “For (2), c’mon it’s not that philosophically tricky, you just do [debate or whatever, fill-in-the-blank]. And for (3), yeah the safety case is subtly different from what people in the corrigibility camp would describe, but saying “the human is not in control” is an over-the-top way to put it; anyway we still have a safety case because of [fill-in-the-blank]. And for (1), I dunno, maybe the people who make the most powerful AI will be unusually wise, and they’ll use it in-house for solving CEV-ASI instead of hoping for global adoption.
…And then the pessimistic side says: I dunno. I’m not sure I really believe any of those. But I guess I’ll stop here, this is already an excessively long comment :)
I went through a bunch of similar thoughts before writing the self-unalignment problem. When we talked about this many years ago with Paul my impression was this is actually somewhat cruxy and we disagree about self-unalignment - where my mental image is if you start with an incoherent bundle of self-conflicted values, and you plug this into IDA-like dynamic, my intuition is you can end up in arbitrary places, including very bad. (Also cf. the part of Scott’s review of What We Owe To Future where he is worried that in a philosophy game, a smart moral philosopher can extrapolate his values to ‘I have to have my eyes pecked out by angry seagulls or something’ and hence does not want to play the game. AIs will likely be more powerful in this game than Will MacAskill)
My current position is we still don’t have a good answer, I don’t trust the response ‘we can just assume the problem away’, and also the response ‘this is just another problem which you can delegate to future systems’. On the other hand, existing AIs already seem doing a lot of value extrapolation and the results sometimes seem surprisingly sane, so, maybe we will get lucky, or larger part of morality is convergent—but it’s worth noting these value-extrapolating AIs are not necessarily what AI labs want or traditional alignment program aims for.
I pretty much agree you can end up in arbitrary places with extrapolated values, and I don’t think morality is convergent, but I also don’t think it matters for the purpose of existential risk, because assuming something like instruction following works, the extrapolation problem can be solved by ordering AIs not to extrapolate values to cases where they get tortured/killed in an ethical scenario, and more generally I don’t expect value extrapolation to matter for the purpose of making an AI safe to use.
The real impact is on CEV style alignment plans/plans for what to do with a future AI, which are really bad plans to do for a lot of people’s current values, and thus I really don’t want CEV to be the basis of alignment.
Thankfully, it’s unlikely to ever be this, but it still matters somewhat, especially since Anthropic is targeting value alignment (though thankfully there is implicit constraints/grounding based on the values chosen).
In the strategy stealing assumption Paul makes an argument about people with short term preferences, that could be applied imo to people who are unwilling to listen to AI advice:
Maybe we can this same argument to people who don’t want to listen to AI advice: yes, this will lead those people to have less control over the future but some people will be willing to listen to AI advice and their preferences will retain influence over the future. This reduces human control over the future, but it’s a one time loss that isn’t catastrophic (that is it doesn’t cause total loss of control). Paul calls this a one-time disadvantage rather than total disempowerment because the rest of humankind can still replicate the critical strategy the unaligned AI might have exploited.
Possible counter: “The group of people who properly listens to AI advice will be too small to matter .” Yeah, I think this could lead to eg a 100x reduction in control over the future (if only 1% of humans properly listens), different people are more or less upset about this. One glimmer of hope is that the humans who do listen to their ai advisors can cooperate with people who don’t and help them get better at listening, thereby further empowering humanity.
For things like solving coordination problems, or societal resilience against violent takeover, I think it can be important that most people, or even virtually all people, are making good foresighted decisions. For example, if we’re worried about a race-to-the-bottom on AI oversight, and half of relevant decisionmakers allow their AI assistants to negotiate a treaty to stop that race on their behalf, but the other half think that’s stupid and don’t participate, then that’s not good enough, there will still be a race-to-the-bottom on AI oversight. Or if 50% of USA government bureaucrats ask their AIs if there’s a way to NOT outlaw testing people for COVID during the early phases of the pandemic, but the other 50% ask their AIs how best to follow the letter of the law and not get embarrassed, then the result may well be that testing is still outlawed.
For example, in this comment, Paul suggests that if all firms are “aligned” with their human shareholders, then the aligned CEOs will recognize if things are going in a long-term bad direction for humans, and they will coordinate to avoid that. That doesn’t work unless EITHER the human shareholders—all of them, not just a few—are also wise enough to be choosing long-term preferences and true beliefs over short-term preferences and motivated reasoning, when those conflict. OR unless the aligned CEOs—again, all of them, not just a few—are injecting the wisdom into the system, putting their thumbs on the scale, by choosing, even over the objections of the shareholders, their long-term preferences and true beliefs over short-term preferences and motivated reasoning.