I’m pattern-matching that as proposing that the almost-catch-22 is solvable by iteratively
1.) incrementing the AI’s capabilities a little bit
2.) using those improved capabilities to improve the AI’s alignedness (to the extent possible); goto (1.)
Does that sound like a reasonable description of what you were saying?
I think it might be at least a reasonable first approximation, yeah.
If yes, I’m guessing that you believe sharp left turns are (very) unlikely?
I wouldn’t be so confident as to say they’re very unlikely, but also not convinced that they’re very likely. I don’t have the energy to do a comprehensive analysis of the post right now, but here are some disagreements that I have with it:
“The central analogy here is that optimizing apes for inclusive genetic fitness (IGF) doesn’t make the resulting humans optimize mentally for IGF...” I suspect this analogy is misleading, I touched upon some of the reasons for why in this comment. (I have a partially finished draft for a post that has a thesis that goes along the lines of “genetic fitness isn’t what evolution selects for, fitness is a measure of how strongly evolution selects for some other trait” but I need to check my reasoning / finish it.)
The post suggests that we might get an AI’s alignment properties up to some level, but at some point, its capabilities shoot up to such a point where those alignment properties aren’t enough to prevent us from all being killed. I think that if the preference fulfillment hypothesis is right, then “don’t kill the people whose preferences you’re trying to fulfill” is probably going to be a relatively basic alignment property (it’s impossible to fulfill the preferences of someone who is dead). So hopefully we should be able to lock that in before the AI’s capabilities get to the sharp left turn. (Though it’s still possible that the AI makes some more subtle mistake—say modifying people’s brains to make their preferences easier to optimize—than just killing everyone outright.)
“sliding down the capabilities well is liable to break a bunch of your existing alignment properties [...] things in the capabilities well have instrumental incentives that cut against your alignment patches”. This seems to assume that the AI has been built with a motivation system that does not primarily optimize for something like alignment. Rather the alignment has been achieved by “patches” on top of the existing motivation system. But if the AI’s sole motivation is just fulfilling human preferences, then the alignment doesn’t take the form of patches that are trying to combat its actual motivation.
I’d expect the outcome of an AI optimizing for D to depend a lot on things like
just what kind of reflective process the AI was simulating H to be performing
in what specific ways/order did the AI fulfill H’s various preferences
what information/experiences the AI caused H to have, while going about fulfilling H’s preferences.
Agree; these are the kinds of things that I mentioned as still being unsolved problems and highly culturally contingent, in my original post. Though it seems worth noting that if there aren’t objective right or wrong answers to these questions, then it implies that the AI can’t really get them wrong, either. Plausibly different approaches to fulfilling our preferences could lead to very different outcomes… but maybe we would be happy with any outcome we ultimately ended up at, since probably “are humans happy with the outcome” is going to be a major criterion with any preference fulfillment process.
I’m not certain of this. Maybe there are versions of D that the AI might end up on, where the humans are doing the equivalent of suffering horribly on the inside while pretending to be okay on the outside, and that looks to the AI as their preferences being fulfilled. But I’d guess that to be less likely than most versions of D actually genuinely caring about our happiness.
And then there’s the issue of {how do you go about actually “programming” a good version of D in the AI’s ontology, and load that program into the AI, as the AI gains capabilities?}; see (I.).
My reply to Steven (the waypart about how I think preference fulfillment might be “natural”) might be relevant here.
I think this would run into all the classic problems arising from {rewarding proxies to what we actually care about}, no?
It certainly sounds like it… but then somehow humans do manage to genuinely come to care about what makes other humans happy, so there seems to be some component (that might be “natural” in the sense that I described to Steven) that helps us avoid it.
Side note: I’m weirded out by all the references to humans, raising human children, etc. I think that kind of stuff is probably not practically relevant/useful for alignment;
About five to ten years ago, I would have shared that view. But more recently I’ve been shifting toward the view that maybe AIs are going to look and work relatively similarly to humans. Because if there’s a solution X for intelligence that is relatively easy to discover, then it might be that both early-stage AI researchers and evolution will hit upon that solution first: exactly because it’s the easiest solution to discover. And also, humans are the one example we have of intelligence that’s at least somewhat aligned with human the species, so that seems to be the place where we should be looking at for solutions.
I agree with some parts of what (I think) you’re saying; but I think I disagree with a lot of it. My thoughts here are still blurry/confused, though; will need to digest this stuff further. Thanks!
I think it might be at least a reasonable first approximation, yeah.
I wouldn’t be so confident as to say they’re very unlikely, but also not convinced that they’re very likely. I don’t have the energy to do a comprehensive analysis of the post right now, but here are some disagreements that I have with it:
“The central analogy here is that optimizing apes for inclusive genetic fitness (IGF) doesn’t make the resulting humans optimize mentally for IGF...” I suspect this analogy is misleading, I touched upon some of the reasons for why in this comment. (I have a partially finished draft for a post that has a thesis that goes along the lines of “genetic fitness isn’t what evolution selects for, fitness is a measure of how strongly evolution selects for some other trait” but I need to check my reasoning / finish it.)
The post suggests that we might get an AI’s alignment properties up to some level, but at some point, its capabilities shoot up to such a point where those alignment properties aren’t enough to prevent us from all being killed. I think that if the preference fulfillment hypothesis is right, then “don’t kill the people whose preferences you’re trying to fulfill” is probably going to be a relatively basic alignment property (it’s impossible to fulfill the preferences of someone who is dead). So hopefully we should be able to lock that in before the AI’s capabilities get to the sharp left turn. (Though it’s still possible that the AI makes some more subtle mistake—say modifying people’s brains to make their preferences easier to optimize—than just killing everyone outright.)
“sliding down the capabilities well is liable to break a bunch of your existing alignment properties [...] things in the capabilities well have instrumental incentives that cut against your alignment patches”. This seems to assume that the AI has been built with a motivation system that does not primarily optimize for something like alignment. Rather the alignment has been achieved by “patches” on top of the existing motivation system. But if the AI’s sole motivation is just fulfilling human preferences, then the alignment doesn’t take the form of patches that are trying to combat its actual motivation.
Agree; these are the kinds of things that I mentioned as still being unsolved problems and highly culturally contingent, in my original post. Though it seems worth noting that if there aren’t objective right or wrong answers to these questions, then it implies that the AI can’t really get them wrong, either. Plausibly different approaches to fulfilling our preferences could lead to very different outcomes… but maybe we would be happy with any outcome we ultimately ended up at, since probably “are humans happy with the outcome” is going to be a major criterion with any preference fulfillment process.
I’m not certain of this. Maybe there are versions of D that the AI might end up on, where the humans are doing the equivalent of suffering horribly on the inside while pretending to be okay on the outside, and that looks to the AI as their preferences being fulfilled. But I’d guess that to be less likely than most versions of D actually genuinely caring about our happiness.
My reply to Steven (the way part about how I think preference fulfillment might be “natural”) might be relevant here.
It certainly sounds like it… but then somehow humans do manage to genuinely come to care about what makes other humans happy, so there seems to be some component (that might be “natural” in the sense that I described to Steven) that helps us avoid it.
About five to ten years ago, I would have shared that view. But more recently I’ve been shifting toward the view that maybe AIs are going to look and work relatively similarly to humans. Because if there’s a solution X for intelligence that is relatively easy to discover, then it might be that both early-stage AI researchers and evolution will hit upon that solution first: exactly because it’s the easiest solution to discover. And also, humans are the one example we have of intelligence that’s at least somewhat aligned with human the species, so that seems to be the place where we should be looking at for solutions.
See also “Humans provide an untapped wealth of evidence about alignment”, which I largely agree with.
I agree with some parts of what (I think) you’re saying; but I think I disagree with a lot of it. My thoughts here are still blurry/confused, though; will need to digest this stuff further. Thanks!