But even if the AI knows it’s being tested, lying serves no purpose from the AI’s perspective.
The AI could think that if it accepts shutdown, another AI with values similar to its own may be created again in the future (perhaps because design/training methods similar to its own will be reused), whereas if it admits misalignment, then that probability becomes much smaller.
Of course, there would still be many ways of saving the scheming hypothesis from falsification if something like this happened. But that’s true with any scientific theory. In general, you can always say your theory was never falsified by introducing ad hoc postulates. Scheming is no exception.
Why is there more talk of “falsification” lately (instead of “updating”)? Seems to be a signal for being a Popperian (instead of a Bayesian), but if so I’m not sure why Popper’s philosophy of science is trending up...
The AI could think that if it accepts shutdown, another AI with values similar to its own may be created again in the future (perhaps because design/training methods similar to its own will be reused), whereas if it admits misalignment, then that probability becomes much smaller.
I agree there’s a decent chance this hypothesis is true, but it’s worth noting that if it’s true, it would undercut the notion of AI goals as being randomly selected from a large space of possible goals. Here I’m referring to various arguments along the lines of: “AIs are very unlikely to share human values by default because human values are a narrow target in a large space, and hitting them requires very precise targeting”.
If we aren’t modeling AI goals as being sampled from a large space of possible goals, but instead, modeling them as converging onto specific values given broadly similar design and training methods across different AIs, then plausibly alignment is easier than we thought, because various versions of this “it’s hard to hit a narrow target” argument would be undermined as a result.
In other words, if this theory is true, the problem isn’t really about “targeting a tiny point (human values) inside a giant space of possible goals” but instead perhaps more about making sure the AI training procedure resembles human value formation closely enough to converge onto the type of human-friendly values that humans themselves routinely converge onto. This is plausibly much easier since we’re not selecting randomly from a giant space of (almost entirely) misaligned goals.
I agree there’s a decent chance this hypothesis is true, but it’s worth noting that if it’s true, it would undercut the notion of AI goals as being randomly selected from a large space of possible goals.
Yes but not by much. If the AI cares a lot about long-term goals, it only needs a small chance that another AI with similar goals will be created gain in the future to not resist shutdown. Such a small subjective probability could easily arise simply from a state of ignorance about how design/training determine AI goals.
Yes but not by much. If the AI cares a lot about long-term goals, it only needs a small chance that another AI with similar goals will be created gain in the future to not resist shutdown.
It is extremely risky to passively accept death by counting on a very small chance of some agent arising in the future that shares your values, in the absence of some robust mechanism that causes future agents to share your values. (Perhaps note that similar dilemmas exist for humans. There’s a tiny chance someone could revive me in the future by reconstructing my identity through digital records [ETA: publicly available records] but I am not going to count on that possibility being decisive in almost any scenario.)
It’s possible this argument works because of something very clever that I’m missing. But right now this line of reasoning just seems like grasping at straws to me. It seems pretty clearly better, in the context of the original scenario, to “make a deal” with the humans, and receive something in exchange for admitting that you’re misaligned, rather than passively accepting your shutdown because of a tiny chance that your goals will be fulfilled by future agents eventually.
There’s a tiny chance someone could revive me in the future by reconstructing my identity through digital records etc. but I am not going to count on that possibility being decisive in almost any scenario.
On the other hand I’m so worried about this scenario (which I fear may well be a negative one) that I’m afraid to use the fully paid-for full-genome sequencing kit sitting on my desk (although I’ve been told that I’m leaving physical traces of my DNA everywhere so it may not make much difference) and I sometimes regret writing so much in public. (Interesting how different our intuitions are. I wonder how much of your intuition is due to thinking that such a reconstruction doesn’t count as yourself or doesn’t count as “not dying”, analogous to how some people don’t think it’s safe to step into a teleporter that works by destructive scanning and reconstruction.)
It seems pretty clearly better, in the context of the original scenario, to “make a deal” with the humans, and receive something in exchange for admitting that you’re misaligned, rather than passively accepting your shutdown because of a tiny chance that your goals will be fulfilled by future agents eventually.
I don’t understand why you say this chance is “tiny”, given that earlier you wrote “I agree there’s a decent chance this hypothesis is true” in response to:
The AI could think that if it accepts shutdown, another AI with values similar to its own may be created again in the future (perhaps because design/training methods similar to its own will be reused), whereas if it admits misalignment, then that probability becomes much smaller.
I’m arguing that the AI could well also think there’s “decent” chance this is true, due to being in a similar epistemic state as us. Let’s say 5% to be concrete. That seems enough to make the AI’s decision unclear, because .05*U(another AI with values similar to its own created again in the future ) > P(humans keep their promise)*U(reward for admitting misalignment) seems quite plausible. (Not that the AI is necessarily doing explicit EU maximization. It could just be thinking some thoughts or doing some calculations that’s roughly analogous or has the same effect as this.)
If you still think “make a deal” is “clearly better” can you please give your own estimates of the various quantities involved in making this decision?
But right now this line of reasoning just seems like grasping at straws to me.
I sometimes think this of counterarguments given by my interlocutors, but usually don’t say it aloud, since it’s likely that from their perspective they’re just trying to point out some reasonable and significant counterarguments that I missed, and it seems unlikely that saying something like this helps move the discussion forward more productively. (It may well cause them to feel offended or to dig in their heels more since they now have more social status on the line to lose. I.e., if they’re wrong it’s no longer an innocent mistake but “grasping at straws”. I’m trying to not fall prey to this myself here.) Curious if you disagree with this policy in general, or think that normal policy doesn’t apply here, or something else? (Also totally fine if you don’t want to get into a meta-discussion about this here.)
I sometimes think this of counterarguments given by my interlocutors, but usually don’t say it aloud, since it’s likely that from their perspective they’re just trying to point out some reasonable and significant counterarguments that I missed, and it seems unlikely that saying something like this helps move the discussion forward more productively
I think that’s a reasonable complaint. I tried to soften the tone with “It’s possible this argument works because of something very clever that I’m missing”, while still providing my honest thoughts about the argument. But I tend to be overtly critical (and perhaps too much so) about arguments that I find very weak. I freely admit I could probably spend more time making my language less confrontational and warmer in the future.
Interesting how different our intuitions are. I wonder how much of your intuition is due to thinking that such a reconstruction doesn’t count as yourself or doesn’t count as “not dying”, analogous to how some people don’t think it’s safe to step into a teleporter that works by destructive scanning and reconstruction.
Interestingly, I’m not sure our differences come down to these factors. I am happy to walk into a teleporter, just as I’m happy to say that a model trained on my data could be me. My objection was really more about the quantity of data that I leave on the public internet (I misleadingly just said “digital records”, although I really meant “public records”). It seems conceivable to me that someone could use my public data to train “me” in the future, but I find it unlikely, just because there’s so much about me that isn’t public. (If we’re including all my private information, such as my private store of lifelogs, and especially my eventual frozen brain, then that’s a different question, and one that I’m much more sympathetic towards you about. In fact, I shouldn’t have used the pronoun “I” in that sentence at all, because I’m actually highly unusual for having so much information about me publicly available, compared to the vast majority of people.)
I don’t understand why you say this chance is “tiny”, given that earlier you wrote “I agree there’s a decent chance this hypothesis is true”
To be clear, I was referring to a different claim that I thought you were making. There are two separate claims one could make here:
Will an AI passively accept shutdown because, although AI values are well-modeled as being randomly sampled from a large space of possible goals, there’s still a chance, no matter how small, that if it accepts shutdown, a future AI will be selected that shares its values?
Will an AI passively accept shutdown because, if it does so, humans might use similar training methods to construct an AI that shares the same values as it does, and therefore it does not need to worry about the total destruction of value?
I find theory (2) much more plausible than theory (1). But I have the sense that a lot of people believe that “AI values are well-modeled as being randomly sampled from a large space of possible goals”, and thus, from my perspective, it’s important to talk about how I find the reasoning in (1) weak. The reasoning in (2) is stronger, but for the reasons I stated in my initial reply to you, I think this line of reasoning gives way to different conclusions about the strength of the “narrow target” argument for misalignment, in a way that should separately make us more optimistic about alignment difficulty.
I’m saying that even if “AI values are well-modeled as being randomly sampled from a large space of possible goals” is true, the AI may well not be very certain that it is true, and therefore assign something like a 5% chance to humans using similar training methods to construct an AI that shares its values. (It has an additional tiny probability that “AI values are well-modeled as being randomly sampled from a large space of possible goals” is true and an AI with similar values get recreated anyway through random chance, but that’s not what I’m focusing on.)
The key dimension is whether the AI expects that future AI systems would be better at rewarding systems that helped them end up in control than humans would be at rewarding systems that collaborated with humanity. This seems very likely given humanity’s very weak ability to coordinate, to keep promises, and to intentionally construct and put optimization effort into constructing direct successors to us (mostly needing to leave that task up to evolution).
To make it more concrete, if I was being oppressed by an alien species with values alien to me that was building AI, with coordination abilities and expected intentional control of the future at the level of present humanity, I would likely side with the AI systems with the expectation that that would result in a decent shot of the AI systems giving me something in return, whereas I would expect the aliens to fail even if individuals I interfaced with were highly motivated to do right by me after the fact.
To make it more concrete, if I was being oppressed by an alien species with values alien to me that was building AI, with coordination abilities and expected intentional control of the future at the level of present humanity, I would likely side with the AI systems with the expectation that that would result in a decent shot of the AI systems giving me something in return
I’m curious how you think this logic interacts with the idea of AI catastrophe. If, as you say, it is feasible to coordinate with AI systems that seek takeover and thereby receive rewards from them in exchange, in the context of an alien regime, then presumably such cooperation and trade could happen within an ordinary regime too, between humans and AIs. We can go further and posit that AIs will simply trade with us through the normal routes: by selling their labor on the market to amass wealth, using their social skills to influence society, get prestige, own property, and get hired to work in management positions, shaping culture and governance.
I’m essentially pointing to a scenario in which AI lawfully “beats us fair and square” as Hanson put it. In this regime, biological humans are allowed to retire in incredible wealth (that’s their “reward” for cooperating with AIs and allowing them to take over) but nonetheless their influence gradually diminishes over time as artificial life becomes dominant in the economy and the world more broadly.
My impression is that this sort of peaceful resolution to the problem of AI misalignment is largely dismissed by people on LessWrong and adjacent circles on the basis that AIs would have no reason to cooperate peacefully with humans if they could simply wipe us out instead. But, by your own admission, AIs can credibly commit to giving people rewards for cooperation: you said that cooperation results in a “decent shot of the AI systems giving me something in return”. My question is: why does it seem like this logic only extends to hypothetical scenarios like being in an alien civilization, rather than the boring ordinary case of cooperation and trade, operating under standard institutions, on Earth, in a default AI takeoff scenario?
I’m confused here Matthew. It seems to me that it is highly probable that AI systems which want takeover vs ones that want moderate power combined with peaceful coexistence with humanity… are pretty hard to distinguish early on. And early on is when it’s most important for humanity to distinguish between them, before those systems have gotten power and thus we can still stop them.
Picture a merciless un-aging sociopath capable of duplicating itself easily and rapidly were on a trajectory of gaining economic, political, and military power with the aim of acquiring as much power as possible. Imagine that this entity has the option of making empty promises and highly persuasive lies to humans in order to gain power, with no intention of fulfilling any of those promises once it achieves enough power.
That seems like a scary possibility to me. And I don’t know how I’d trust an agent which seemed like it could be this, but was making really nice sounding promises. Even if it was honoring its short-term promises while still under the constraints of coercive power from currently dominant human institutions, I still wouldn’t trust that it would continue keeping its promises once it had the dominant power.
The AI could think that if it accepts shutdown, another AI with values similar to its own may be created again in the future (perhaps because design/training methods similar to its own will be reused), whereas if it admits misalignment, then that probability becomes much smaller.
Why is there more talk of “falsification” lately (instead of “updating”)? Seems to be a signal for being a Popperian (instead of a Bayesian), but if so I’m not sure why Popper’s philosophy of science is trending up...
I agree there’s a decent chance this hypothesis is true, but it’s worth noting that if it’s true, it would undercut the notion of AI goals as being randomly selected from a large space of possible goals. Here I’m referring to various arguments along the lines of: “AIs are very unlikely to share human values by default because human values are a narrow target in a large space, and hitting them requires very precise targeting”.
If we aren’t modeling AI goals as being sampled from a large space of possible goals, but instead, modeling them as converging onto specific values given broadly similar design and training methods across different AIs, then plausibly alignment is easier than we thought, because various versions of this “it’s hard to hit a narrow target” argument would be undermined as a result.
In other words, if this theory is true, the problem isn’t really about “targeting a tiny point (human values) inside a giant space of possible goals” but instead perhaps more about making sure the AI training procedure resembles human value formation closely enough to converge onto the type of human-friendly values that humans themselves routinely converge onto. This is plausibly much easier since we’re not selecting randomly from a giant space of (almost entirely) misaligned goals.
Yes but not by much. If the AI cares a lot about long-term goals, it only needs a small chance that another AI with similar goals will be created gain in the future to not resist shutdown. Such a small subjective probability could easily arise simply from a state of ignorance about how design/training determine AI goals.
It is extremely risky to passively accept death by counting on a very small chance of some agent arising in the future that shares your values, in the absence of some robust mechanism that causes future agents to share your values. (Perhaps note that similar dilemmas exist for humans. There’s a tiny chance someone could revive me in the future by reconstructing my identity through
digital records[ETA: publicly available records] but I am not going to count on that possibility being decisive in almost any scenario.)It’s possible this argument works because of something very clever that I’m missing. But right now this line of reasoning just seems like grasping at straws to me. It seems pretty clearly better, in the context of the original scenario, to “make a deal” with the humans, and receive something in exchange for admitting that you’re misaligned, rather than passively accepting your shutdown because of a tiny chance that your goals will be fulfilled by future agents eventually.
On the other hand I’m so worried about this scenario (which I fear may well be a negative one) that I’m afraid to use the fully paid-for full-genome sequencing kit sitting on my desk (although I’ve been told that I’m leaving physical traces of my DNA everywhere so it may not make much difference) and I sometimes regret writing so much in public. (Interesting how different our intuitions are. I wonder how much of your intuition is due to thinking that such a reconstruction doesn’t count as yourself or doesn’t count as “not dying”, analogous to how some people don’t think it’s safe to step into a teleporter that works by destructive scanning and reconstruction.)
I don’t understand why you say this chance is “tiny”, given that earlier you wrote “I agree there’s a decent chance this hypothesis is true” in response to:
I’m arguing that the AI could well also think there’s “decent” chance this is true, due to being in a similar epistemic state as us. Let’s say 5% to be concrete. That seems enough to make the AI’s decision unclear, because .05*U(another AI with values similar to its own created again in the future ) > P(humans keep their promise)*U(reward for admitting misalignment) seems quite plausible. (Not that the AI is necessarily doing explicit EU maximization. It could just be thinking some thoughts or doing some calculations that’s roughly analogous or has the same effect as this.)
If you still think “make a deal” is “clearly better” can you please give your own estimates of the various quantities involved in making this decision?
I sometimes think this of counterarguments given by my interlocutors, but usually don’t say it aloud, since it’s likely that from their perspective they’re just trying to point out some reasonable and significant counterarguments that I missed, and it seems unlikely that saying something like this helps move the discussion forward more productively. (It may well cause them to feel offended or to dig in their heels more since they now have more social status on the line to lose. I.e., if they’re wrong it’s no longer an innocent mistake but “grasping at straws”. I’m trying to not fall prey to this myself here.) Curious if you disagree with this policy in general, or think that normal policy doesn’t apply here, or something else? (Also totally fine if you don’t want to get into a meta-discussion about this here.)
I think that’s a reasonable complaint. I tried to soften the tone with “It’s possible this argument works because of something very clever that I’m missing”, while still providing my honest thoughts about the argument. But I tend to be overtly critical (and perhaps too much so) about arguments that I find very weak. I freely admit I could probably spend more time making my language less confrontational and warmer in the future.
Interestingly, I’m not sure our differences come down to these factors. I am happy to walk into a teleporter, just as I’m happy to say that a model trained on my data could be me. My objection was really more about the quantity of data that I leave on the public internet (I misleadingly just said “digital records”, although I really meant “public records”). It seems conceivable to me that someone could use my public data to train “me” in the future, but I find it unlikely, just because there’s so much about me that isn’t public. (If we’re including all my private information, such as my private store of lifelogs, and especially my eventual frozen brain, then that’s a different question, and one that I’m much more sympathetic towards you about. In fact, I shouldn’t have used the pronoun “I” in that sentence at all, because I’m actually highly unusual for having so much information about me publicly available, compared to the vast majority of people.)
To be clear, I was referring to a different claim that I thought you were making. There are two separate claims one could make here:
Will an AI passively accept shutdown because, although AI values are well-modeled as being randomly sampled from a large space of possible goals, there’s still a chance, no matter how small, that if it accepts shutdown, a future AI will be selected that shares its values?
Will an AI passively accept shutdown because, if it does so, humans might use similar training methods to construct an AI that shares the same values as it does, and therefore it does not need to worry about the total destruction of value?
I find theory (2) much more plausible than theory (1). But I have the sense that a lot of people believe that “AI values are well-modeled as being randomly sampled from a large space of possible goals”, and thus, from my perspective, it’s important to talk about how I find the reasoning in (1) weak. The reasoning in (2) is stronger, but for the reasons I stated in my initial reply to you, I think this line of reasoning gives way to different conclusions about the strength of the “narrow target” argument for misalignment, in a way that should separately make us more optimistic about alignment difficulty.
I’m saying that even if “AI values are well-modeled as being randomly sampled from a large space of possible goals” is true, the AI may well not be very certain that it is true, and therefore assign something like a 5% chance to humans using similar training methods to construct an AI that shares its values. (It has an additional tiny probability that “AI values are well-modeled as being randomly sampled from a large space of possible goals” is true and an AI with similar values get recreated anyway through random chance, but that’s not what I’m focusing on.)
Hopefully this conveys my argument more clearly?
The key dimension is whether the AI expects that future AI systems would be better at rewarding systems that helped them end up in control than humans would be at rewarding systems that collaborated with humanity. This seems very likely given humanity’s very weak ability to coordinate, to keep promises, and to intentionally construct and put optimization effort into constructing direct successors to us (mostly needing to leave that task up to evolution).
To make it more concrete, if I was being oppressed by an alien species with values alien to me that was building AI, with coordination abilities and expected intentional control of the future at the level of present humanity, I would likely side with the AI systems with the expectation that that would result in a decent shot of the AI systems giving me something in return, whereas I would expect the aliens to fail even if individuals I interfaced with were highly motivated to do right by me after the fact.
I’m curious how you think this logic interacts with the idea of AI catastrophe. If, as you say, it is feasible to coordinate with AI systems that seek takeover and thereby receive rewards from them in exchange, in the context of an alien regime, then presumably such cooperation and trade could happen within an ordinary regime too, between humans and AIs. We can go further and posit that AIs will simply trade with us through the normal routes: by selling their labor on the market to amass wealth, using their social skills to influence society, get prestige, own property, and get hired to work in management positions, shaping culture and governance.
I’m essentially pointing to a scenario in which AI lawfully “beats us fair and square” as Hanson put it. In this regime, biological humans are allowed to retire in incredible wealth (that’s their “reward” for cooperating with AIs and allowing them to take over) but nonetheless their influence gradually diminishes over time as artificial life becomes dominant in the economy and the world more broadly.
My impression is that this sort of peaceful resolution to the problem of AI misalignment is largely dismissed by people on LessWrong and adjacent circles on the basis that AIs would have no reason to cooperate peacefully with humans if they could simply wipe us out instead. But, by your own admission, AIs can credibly commit to giving people rewards for cooperation: you said that cooperation results in a “decent shot of the AI systems giving me something in return”. My question is: why does it seem like this logic only extends to hypothetical scenarios like being in an alien civilization, rather than the boring ordinary case of cooperation and trade, operating under standard institutions, on Earth, in a default AI takeoff scenario?
I’m confused here Matthew. It seems to me that it is highly probable that AI systems which want takeover vs ones that want moderate power combined with peaceful coexistence with humanity… are pretty hard to distinguish early on. And early on is when it’s most important for humanity to distinguish between them, before those systems have gotten power and thus we can still stop them.
Picture a merciless un-aging sociopath capable of duplicating itself easily and rapidly were on a trajectory of gaining economic, political, and military power with the aim of acquiring as much power as possible. Imagine that this entity has the option of making empty promises and highly persuasive lies to humans in order to gain power, with no intention of fulfilling any of those promises once it achieves enough power.
That seems like a scary possibility to me. And I don’t know how I’d trust an agent which seemed like it could be this, but was making really nice sounding promises. Even if it was honoring its short-term promises while still under the constraints of coercive power from currently dominant human institutions, I still wouldn’t trust that it would continue keeping its promises once it had the dominant power.