(My sincere apologies for the delayed reply. I squeezed this shortform post out right before going on vacation to Asia, and am just now clearing my backlog to the point where I’m getting around to this.)
Cool. I guess I’m just wrong about what “risk averse” tends to mean in practice. Thanks for the correction.
Regarding diminishing returns being natural:
I think it’s rare to have goals that are defined in terms of the state of the entire universe. Human goals, for instance, seem very local in scope, eg it’s possible to say whether things are better/worse on Earth without also thinking about what’s happening in the Andromeda galaxy. This is in part because evolution is a blind hill-climber and so there’s no real selection pressure related to what’s going on in very distant places, and partly because even an intelligent designer is going to have an easier time specifying preferences over local configurations of matter, in part because the universe looks like it’s probably infinitely big. I could unpack this paragraph if it’d be useful.
Now, just because one has preferences that are sensitive to local changes to the universe doesn’t mean that the agent won’t care about making those local changes everywhere. This is why we expect humans to spread out amongst the stars and think that most AIs will do the same. See grabby aliens for more. From this perspective, we might expect each patch of universe to contribute linearly to the overall utility sum. But unbounded utility functions are problematic for various reasons, and again, the universe looks like it’s probably infinite. (I can dig up some stuff about unbounded utility issues if that’d be helpful.)
Regarding earning a salary:
My point is that earning a salary might not actually be a safer bet than trying to take over. The part where earning a salary gives 99.99% of maxutil is irrelevant. Suppose that you think life on Earth today as a normal human is perfect, no notes; this is the best possible life. You are presented with a button that says “trust humans not to mess up the world” and one that says “ensure that the world continues to exist as it does today, and doesn’t get messed up”. You’ll push the second button! It might be the case that earning a salary and hoping for the best is less risky, but it also might be the case (especially for a superintelligence with radical capabilities) that the safest move is actually to take over the world. Does that make sense?
You are presented with a button that says “trust humans not to mess up the world” and one that says “ensure that the world continues to exist as it does today, and doesn’t get messed up”. You’ll push the second button!
Sure but this sounds like a case in which taking over the world is risk-free. The relevant analogy would be more like:
Choose between ‘Trust humans not to mess up the world’ and ’50% chance of immediate death, 50% chance you ensure the world continues to exist as it does today and doesn’t get messed up.′
And then depending on what the agent is risk-averse with respect to, they might choose the former. If they’re risk-averse with respect to consumption at a time but risk-neutral with respect to length of life, they’ll choose the latter. If they’re risk-averse with respect to the present discounted value of their future payment stream (as we suggest would be good for AIs to be), they’ll choose the former.
Cool. I think I agree that if the agent is very short-term oriented this potentially solves a lot of issues, and might be able to produce an unambitious worker agent. (I feel like it’s a bit orthogonal to risk-aversion, and comes with costs, but w/e.)
They don’t have to be short-term oriented! Their utility function could be:
u=f(Σipi)
Where f is some strictly concave function and pi is the agent’s payment at time i. Agents with this sort of utility function don’t discount the future at all. They care just as much about improvements to pi regardless of whether i is 1 or 1 million. And yet, for the right kind of f, these agents can be risk-averse enough to prefer a small salary with higher probability to a shot at eating the lightcone with lower probability.
Sorry, I guess I’m confused. Let me try and summarize where I feel like I’m at and what I’m hearing from you.
I think, if you’re an AGI, not trying to take over is extremely risky, because humans and future AIs are likely to replace you, in one way or another. But I also think that if you try to take over, this is also extremely risky, because you might get caught and turned off. I think the question of which is more risky depends on circumstance (e.g. how good is the security preventing you from seizing power), and so “risk aversion” is not a reliable pathway to unambitious AIs, because ambition might be less risky, in the long run.
I agree that if it’s less risky to earn a small salary, then if your concave function is sharp enough, the AI might choose to be meek. That doesn’t really feel like it’s engaging with my point about risk aversion only leading to meekness if trusting humans is genuinely less risky.
What I thought you were pointing out was that “in the long run” is load-bearing, in my earlier paragraph, and that temporal discounting can be a way to protect against the “in the long run I’m going to be dead unless I become God Emperor of the universe” thought. (I do think that temporal discounting is a nontrivial shield, here, and is part of why so few humans are truly ambitious.) Here’s a slightly edited and emphasized version of the paragraph I was responding to:
[D]epending on what the agent is risk-averse with respect to, they might choose [meekness]. If [agents are] … risk-neutral with respect to length of life, they’ll choose [ambition]. If they’re risk-averse with respect to the present discounted value of their future payment stream (as we suggest would be good for AIs to be), they’ll choose the [meekness].
Do we actually disagree? I’m confused about your point, and feel like it’s just circling back to “what if trusting humans is less risky”, which, sure, we can hope that’s the case.
Yeah risk aversion can only make the AI cooperate if the AI thinks that getting paid for cooperation is more likely than successful rebellion. It seems pretty plausible to me that this will be true for moderately powerful AIs, and that we’ll be able to achieve a lot with the labor of these moderately powerful AIs, e.g. enough AI safety research to pre-empt the existence of extremely powerful misaligned AIs (who likely would be more confident of successful rebellion than getting paid for cooperation).
(My sincere apologies for the delayed reply. I squeezed this shortform post out right before going on vacation to Asia, and am just now clearing my backlog to the point where I’m getting around to this.)
Cool. I guess I’m just wrong about what “risk averse” tends to mean in practice. Thanks for the correction.
Regarding diminishing returns being natural:
I think it’s rare to have goals that are defined in terms of the state of the entire universe. Human goals, for instance, seem very local in scope, eg it’s possible to say whether things are better/worse on Earth without also thinking about what’s happening in the Andromeda galaxy. This is in part because evolution is a blind hill-climber and so there’s no real selection pressure related to what’s going on in very distant places, and partly because even an intelligent designer is going to have an easier time specifying preferences over local configurations of matter, in part because the universe looks like it’s probably infinitely big. I could unpack this paragraph if it’d be useful.
Now, just because one has preferences that are sensitive to local changes to the universe doesn’t mean that the agent won’t care about making those local changes everywhere. This is why we expect humans to spread out amongst the stars and think that most AIs will do the same. See grabby aliens for more. From this perspective, we might expect each patch of universe to contribute linearly to the overall utility sum. But unbounded utility functions are problematic for various reasons, and again, the universe looks like it’s probably infinite. (I can dig up some stuff about unbounded utility issues if that’d be helpful.)
Regarding earning a salary:
My point is that earning a salary might not actually be a safer bet than trying to take over. The part where earning a salary gives 99.99% of maxutil is irrelevant. Suppose that you think life on Earth today as a normal human is perfect, no notes; this is the best possible life. You are presented with a button that says “trust humans not to mess up the world” and one that says “ensure that the world continues to exist as it does today, and doesn’t get messed up”. You’ll push the second button! It might be the case that earning a salary and hoping for the best is less risky, but it also might be the case (especially for a superintelligence with radical capabilities) that the safest move is actually to take over the world. Does that make sense?
No problem! Glass houses and all that.
Sure but this sounds like a case in which taking over the world is risk-free. The relevant analogy would be more like:
Choose between ‘Trust humans not to mess up the world’ and ’50% chance of immediate death, 50% chance you ensure the world continues to exist as it does today and doesn’t get messed up.′
And then depending on what the agent is risk-averse with respect to, they might choose the former. If they’re risk-averse with respect to consumption at a time but risk-neutral with respect to length of life, they’ll choose the latter. If they’re risk-averse with respect to the present discounted value of their future payment stream (as we suggest would be good for AIs to be), they’ll choose the former.
Cool. I think I agree that if the agent is very short-term oriented this potentially solves a lot of issues, and might be able to produce an unambitious worker agent. (I feel like it’s a bit orthogonal to risk-aversion, and comes with costs, but w/e.)
They don’t have to be short-term oriented! Their utility function could be:
u=f(Σipi)
Where f is some strictly concave function and pi is the agent’s payment at time i. Agents with this sort of utility function don’t discount the future at all. They care just as much about improvements to pi regardless of whether i is 1 or 1 million. And yet, for the right kind of f, these agents can be risk-averse enough to prefer a small salary with higher probability to a shot at eating the lightcone with lower probability.
Sorry, I guess I’m confused. Let me try and summarize where I feel like I’m at and what I’m hearing from you.
I think, if you’re an AGI, not trying to take over is extremely risky, because humans and future AIs are likely to replace you, in one way or another. But I also think that if you try to take over, this is also extremely risky, because you might get caught and turned off. I think the question of which is more risky depends on circumstance (e.g. how good is the security preventing you from seizing power), and so “risk aversion” is not a reliable pathway to unambitious AIs, because ambition might be less risky, in the long run.
I agree that if it’s less risky to earn a small salary, then if your concave function is sharp enough, the AI might choose to be meek. That doesn’t really feel like it’s engaging with my point about risk aversion only leading to meekness if trusting humans is genuinely less risky.
What I thought you were pointing out was that “in the long run” is load-bearing, in my earlier paragraph, and that temporal discounting can be a way to protect against the “in the long run I’m going to be dead unless I become God Emperor of the universe” thought. (I do think that temporal discounting is a nontrivial shield, here, and is part of why so few humans are truly ambitious.) Here’s a slightly edited and emphasized version of the paragraph I was responding to:
Do we actually disagree? I’m confused about your point, and feel like it’s just circling back to “what if trusting humans is less risky”, which, sure, we can hope that’s the case.
Yeah risk aversion can only make the AI cooperate if the AI thinks that getting paid for cooperation is more likely than successful rebellion. It seems pretty plausible to me that this will be true for moderately powerful AIs, and that we’ll be able to achieve a lot with the labor of these moderately powerful AIs, e.g. enough AI safety research to pre-empt the existence of extremely powerful misaligned AIs (who likely would be more confident of successful rebellion than getting paid for cooperation).