Thank you for the clarifications! I agree it’s possible I misunderstood how the proposed AUP variant is supposed to relate to the concept of impact given in the sequence. However, this is not the core of my objection. If I evaluate the agent-reward AUP proposal (as given in Equations 2-5 in this post) on its own merits, independently of the rest of the sequence, I still do not agree that this is a good impact measure.
Here are some reasons I don’t endorse this approach:
1. I have an intuitive sense that defining the auxiliary reward in terms of the main reward results in a degenerate incentive structure that directly pits the task reward and the auxiliary reward against each other. As I think Rohin has pointed out somewhere, this approach seems likely to either do nothing or just optimize the reward function, depending on the impact penalty parameter, which result in a useless agent.
2. I share Rohin’s concerns in this comment that agent-reward AUP is a poor proxy for power and throws away the main benefits of AUP. I think those concerns have not been addressed (in your recent responses to his comment or elsewhere).
3. Unlike AUP with random rewards, which can easily be set to avoid side effects by penalizing decreases, agent-reward AUP cannot avoid side effects even in principle. I think that the ability to avoid side effects is an essential component of a good impact measure.
Incorrect. It would be fair to say that it hasn’t been thoroughly validated.
As far as I can tell from the Scaling to Superhuman post, it has only been tested on the shutdown gridworld. This is far from sufficient for experimental validation. I think this approach needs to be tested in a variety of environments to show that this agent can do something useful that doesn’t just optimize the reward (to address the concern in point 1).
I agree it would perform poorly, but that’s because the CCC does not apply to SafeLife.
Not sure what you mean by the CCC not applying to SafeLife—do you mean that it is not relevant or that doesn’t hold in this environment? I get the sense that it doesn’t hold, which seems concerning. If I only care about green life patterns in SafeLife, the fact that the agent is not seeking power is cold comfort to me if it destroys all the green patterns. This seems like a catastrophe if I can’t create any green patterns once they are gone, so my ability to get what I want is destroyed.
Sorry if I seem overly harsh or dismissive—I feel it is very important to voice my disagreement here to avoid the appearance of consensus that agent-reward AUP is the default / state of the art approach in impact regularization.
Here are some reasons I don’t endorse this approach:
I think this makes sense – you come in and wonder “what’s going on, this doesn’t even pass the basic test cases?!”.
Some context: in the superintelligent case, I often think about “what agent design would incentivize putting a strawberry on a plate, without taking over the world”? Although I certainly agree SafeLife-esque side effects are important, power-seeking might be the primary avenue to impact for sufficiently intelligent systems. Once a system is smart enough, it might realize that breaking vases would get it in trouble, so it avoids breaking vases as long as we have power over it.
If we can’t deal with power-seeking, then we can’t deal with power-seeking & smaller side effects at the same time. So, I set out to deal with power-seeking for the superintelligent case.
Under this threat model, the random reward AUP penalty (and the RR penalty AFAICT) can be avoided with the help of a “delusion box” which holds the auxiliary AUs constant. Then, the agent can catastrophically gain power without penalty. (See also: Stuart’s subagent sequence)
I investigated whether we can get an equation which implements the reasoning in my first comment: “optimize the objective, without becoming more able to optimize the objective”. As you say, I think Rohin and others have given good arguments that my preliminary equations don’t work as well as we’d like. Intuitively, though, it feels like there might be a better way to implement that reasoning.
I think the agent-reward equations do help avoid certain kinds of loopholes, and that they expose key challenges for penalizing power seeking. Maybe going back to the random rewards or a different baseline helps overcome those challenges, but it’s not clear to me that that’s true.
I think this approach needs to be tested in a variety of environments to show that this agent can do something useful that doesn’t just optimize the reward (to address the concern in point 1).
I’m pretty curious about that – implementing eg Stuart’s power-seeking gridworld would probably make a good project for anyone looking to get into AI safety. (I’d do it myself, but coding is hard through dictation)
Not sure what you mean by the CCC not applying to SafeLife—do you mean that it is not relevant or that doesn’t hold in this environment? I get the sense that it doesn’t hold, which seems concerning.
I meant that it isn’t relevant to this environment. In the CCC post, I write:
“But what about the Blackwell-optimal policy for Tic-Tac-Toe? These agents aren’t taking over the world now”. The CCC is talking about agents optimizing a reward function in the real world (or, for generality, in another sufficiently complex multiagent environment).
This sequence doesn’t focus on other kinds of environments, so there’s probably more good thinking to do about what I called “interfaces”.
I feel it is very important to voice my disagreement here to avoid the appearance of consensus that agent-reward AUP is the default / state of the art approach in impact regularization.
That makes sense. I’m only speaking for myself, after all. For the superintelligent case, I am slightly more optimistic about approaches relying on agent-reward. I agree that those approaches are wildly inappropriate for other classes of problems, such as SafeLife.
Thanks! I certainly agree that power-seeking is important to address, and I’m glad you are thinking deeply about it. However, I’m uncertain whether to expect it to be the primary avenue to impact for superintelligent systems, since I am not currently convinced that the CCC holds.
One intuition that informs this is that the non-AI global catastrophic risk scenarios that we worry about (pandemics, accidental nuclear war, extreme climate change, etc) don’t rely on someone taking over the world, so a superintelligent AI could relatively easily trigger them without taking over the world (since our world is pretty fragile). For example, suppose you have a general AI tasked with developing a novel virus in a synthetic biology lab. Accidentally allowing the virus to escape could cause a pandemic and kill most or all life on the planet, but it would not be a result of power-seeking behavior. If the pandemic does not increase the AI’s ability to get more reward (which it receives by designing novel viruses), then agent-reward AUP would penalize the AI for reading biology textbooks but would not penalize the AI for causing a pandemic. That doesn’t seem right.
I agree that the agent-reward equations seem like a good intuition pump for thinking about power-seeking. The specific equations you currently have seem to contain a few epicycles designed to fix various issues, which makes me suspect that there are more issues that are not addressed. I have a sense there is probably a simpler formulation of this idea that would provide better intuitions for power-seeking, though I’m not sure what it would look like.
Regarding environments, I believe Stuart is working on implementing the subagent gridworlds, so you don’t need to code them up yourself. I think it would also be useful to construct an environment to test for power-seeking that does not involve subagents. Such an environment could have three possible behaviors like:
1. Put a strawberry on a plate, without taking over the world
2. Put a strawberry on a plate while taking over the world
3. Do nothing
I think you’d want to show that the agent-reward AUP agent can do 1, as opposed to switching between 2 and 3 depending on the penalty parameter.
I can clarify my earlier statement on what struck me as a bit misleading in the narrative of the sequence. I agree that you distinguish between the AUP versions (though explicitly introducing different terms for them would help), so someone who is reading carefully would realize that the results for random rewards don’t apply to the agent-reward case. However, the overall narrative flow seems unnecessarily confusing and could unintentionally mislead a less careful reader (like myself 2 months ago). The title of the post “AUP: Scaling to Superhuman” does not suggest to me that this post introduces a new approach. The term “scaling” usually means making an existing approach work in more realistic / difficult settings, so I think it sets up the expectation that it would be scaling up AUP with random rewards. If the post introduces new problems and a new approach to address them, the title should reflect this. Starting this post by saying “we are pretty close to the impact measurement endgame” seems a bit premature as well. This sentence is also an example of what gave me the impression that you were speaking on behalf of the field (rather than just for yourself) in this sequence.
Starting this post by saying “we are pretty close to the impact measurement endgame” seems a bit premature as well. This sentence is also an example of what gave me the impression that you were speaking on behalf of the field (rather than just for yourself) in this sequence.
What I actually said was:
I think we’re plausibly quite close to the impact measurement endgame
First, the “I think”, and second, the “plausibly”. I think the “plausibly” was appropriate, because in worlds where the CCC is true and you can just straightforwardly implement AUPconceptual (“optimize the objective, without becoming more able to optimize the objective”), you don’t need additional ideas to get a superintelligence-safe impact measure.
1. Here’s the conceptual comment and the math comment where I’m pessimistic about replacing the auxiliary set with the agent’s own reward.
However, the agent’s reward is usually not the true human utility, or a good approximation of it. If the agent’s reward was the true human utility, there would be no need to use an impact measure in the first place.
Hmm, I think you’re misunderstanding Vika’s point here (or at least, I think there is a different point, whether Vika was saying it or not). Here’s the argument, spelled out in more detail:
1. Impact to an arbitrary agent is change in their AU.
2. Therefore, to prevent catastrophe via regularizing impact, we need to have an AI system that is penalized for changing a human’s AU.
3. By assumption, the AI’s utility function RA is different from the human’s RH (otherwise there wouldn’t be any problem).
4. We need to ensure H can pursue RH, but we’re regularizing A pursuing RA. Why should we expect the latter to cause the former to happen?
One possible reason is there’s an underlying factor which is how much powerA has, and as long as this is low it implies that any agent (including H) can pursue their own reward about as much as they could in A‘s absence (this is basically CCC). Then, if we believe that regularizing A pursuing RA keeps A’s power low, we would expect it also means that H remains able to pursue RH. I don’t really believe the premise there (unless you regularize so strongly that the agent does nothing).
I don’t know what it means. How do you optimize for something without becoming more able to optimize for it? If you had said this to me and I hadn’t read your sequence and so knew what you were trying to say, I’d have given you a blank stare—the closest thing I have to an interpretation is “be myopic / greedy”, but that limits your AI system to the point of uselessness.
Like, “optimize for X” means “do stuff over a period of time such that X goes up as much as possible”. “Becoming more able to optimize for X” means “do a thing such that in the future you can do stuff such that X goes up more than it otherwise would have”. The only difference between these two is actions that you can do for immediate reward.
(This is just saying in English what I was arguing for in the math comment.)
if you’re managing a factory, I can say “Rohin, I want you to make me a lot of paperclips this month, but if I find out you’ve increased production capacity or upgraded machines, I’m going to fire you”. You don’t even have to behave greedily – you can plan for possible problems and prevent them, without upgrading your production capacity from where it started.
I think this is a natural concept and is distinct from particular formalizations of it.
edit: consider the three plans
Make 10 paperclips a day
Make 10 paperclips a day, but take over the planet and control a paperclip conglomerate which could turn out millions of paperclips each day, but which in fact never does.
take over the planet and make millions of paperclips each day.
Seems like that only makes sense because you specified that “increasing production capacity” and “upgrading machines” are the things that I’m not allowed to do, and those are things I have a conceptual grasp on. And even then—am I allowed to repair machines that break? What about buying a new factory? What if I force workers to work longer hours? What if I create effective propaganda that causes other people to give you paperclips? What if I figure out that by using a different source of steel I can reduce the defect rate? I am legitimately conceptually uncertain whether these things count as “increasing production capacity / upgrading machines”.
As another example, what does it mean to optimize for “curing cancer” without becoming more able to optimize for “curing cancer”?
Sorry, forgot to reply. I think these are good questions, and I continue to have intuitions that there’s something here, but I want to talk about these points more fully in a later post. Or, think about it more and then explain why I agree with you.
Thank you for the clarifications! I agree it’s possible I misunderstood how the proposed AUP variant is supposed to relate to the concept of impact given in the sequence. However, this is not the core of my objection. If I evaluate the agent-reward AUP proposal (as given in Equations 2-5 in this post) on its own merits, independently of the rest of the sequence, I still do not agree that this is a good impact measure.
Here are some reasons I don’t endorse this approach:
1. I have an intuitive sense that defining the auxiliary reward in terms of the main reward results in a degenerate incentive structure that directly pits the task reward and the auxiliary reward against each other. As I think Rohin has pointed out somewhere, this approach seems likely to either do nothing or just optimize the reward function, depending on the impact penalty parameter, which result in a useless agent.
2. I share Rohin’s concerns in this comment that agent-reward AUP is a poor proxy for power and throws away the main benefits of AUP. I think those concerns have not been addressed (in your recent responses to his comment or elsewhere).
3. Unlike AUP with random rewards, which can easily be set to avoid side effects by penalizing decreases, agent-reward AUP cannot avoid side effects even in principle. I think that the ability to avoid side effects is an essential component of a good impact measure.
As far as I can tell from the Scaling to Superhuman post, it has only been tested on the shutdown gridworld. This is far from sufficient for experimental validation. I think this approach needs to be tested in a variety of environments to show that this agent can do something useful that doesn’t just optimize the reward (to address the concern in point 1).
Not sure what you mean by the CCC not applying to SafeLife—do you mean that it is not relevant or that doesn’t hold in this environment? I get the sense that it doesn’t hold, which seems concerning. If I only care about green life patterns in SafeLife, the fact that the agent is not seeking power is cold comfort to me if it destroys all the green patterns. This seems like a catastrophe if I can’t create any green patterns once they are gone, so my ability to get what I want is destroyed.
Sorry if I seem overly harsh or dismissive—I feel it is very important to voice my disagreement here to avoid the appearance of consensus that agent-reward AUP is the default / state of the art approach in impact regularization.
I think this makes sense – you come in and wonder “what’s going on, this doesn’t even pass the basic test cases?!”.
Some context: in the superintelligent case, I often think about “what agent design would incentivize putting a strawberry on a plate, without taking over the world”? Although I certainly agree SafeLife-esque side effects are important, power-seeking might be the primary avenue to impact for sufficiently intelligent systems. Once a system is smart enough, it might realize that breaking vases would get it in trouble, so it avoids breaking vases as long as we have power over it.
If we can’t deal with power-seeking, then we can’t deal with power-seeking & smaller side effects at the same time. So, I set out to deal with power-seeking for the superintelligent case.
Under this threat model, the random reward AUP penalty (and the RR penalty AFAICT) can be avoided with the help of a “delusion box” which holds the auxiliary AUs constant. Then, the agent can catastrophically gain power without penalty. (See also: Stuart’s subagent sequence)
I investigated whether we can get an equation which implements the reasoning in my first comment: “optimize the objective, without becoming more able to optimize the objective”. As you say, I think Rohin and others have given good arguments that my preliminary equations don’t work as well as we’d like. Intuitively, though, it feels like there might be a better way to implement that reasoning.
I think the agent-reward equations do help avoid certain kinds of loopholes, and that they expose key challenges for penalizing power seeking. Maybe going back to the random rewards or a different baseline helps overcome those challenges, but it’s not clear to me that that’s true.
I’m pretty curious about that – implementing eg Stuart’s power-seeking gridworld would probably make a good project for anyone looking to get into AI safety. (I’d do it myself, but coding is hard through dictation)
I meant that it isn’t relevant to this environment. In the CCC post, I write:
This sequence doesn’t focus on other kinds of environments, so there’s probably more good thinking to do about what I called “interfaces”.
That makes sense. I’m only speaking for myself, after all. For the superintelligent case, I am slightly more optimistic about approaches relying on agent-reward. I agree that those approaches are wildly inappropriate for other classes of problems, such as SafeLife.
Thanks! I certainly agree that power-seeking is important to address, and I’m glad you are thinking deeply about it. However, I’m uncertain whether to expect it to be the primary avenue to impact for superintelligent systems, since I am not currently convinced that the CCC holds.
One intuition that informs this is that the non-AI global catastrophic risk scenarios that we worry about (pandemics, accidental nuclear war, extreme climate change, etc) don’t rely on someone taking over the world, so a superintelligent AI could relatively easily trigger them without taking over the world (since our world is pretty fragile). For example, suppose you have a general AI tasked with developing a novel virus in a synthetic biology lab. Accidentally allowing the virus to escape could cause a pandemic and kill most or all life on the planet, but it would not be a result of power-seeking behavior. If the pandemic does not increase the AI’s ability to get more reward (which it receives by designing novel viruses), then agent-reward AUP would penalize the AI for reading biology textbooks but would not penalize the AI for causing a pandemic. That doesn’t seem right.
I agree that the agent-reward equations seem like a good intuition pump for thinking about power-seeking. The specific equations you currently have seem to contain a few epicycles designed to fix various issues, which makes me suspect that there are more issues that are not addressed. I have a sense there is probably a simpler formulation of this idea that would provide better intuitions for power-seeking, though I’m not sure what it would look like.
Regarding environments, I believe Stuart is working on implementing the subagent gridworlds, so you don’t need to code them up yourself. I think it would also be useful to construct an environment to test for power-seeking that does not involve subagents. Such an environment could have three possible behaviors like:
1. Put a strawberry on a plate, without taking over the world
2. Put a strawberry on a plate while taking over the world
3. Do nothing
I think you’d want to show that the agent-reward AUP agent can do 1, as opposed to switching between 2 and 3 depending on the penalty parameter.
I can clarify my earlier statement on what struck me as a bit misleading in the narrative of the sequence. I agree that you distinguish between the AUP versions (though explicitly introducing different terms for them would help), so someone who is reading carefully would realize that the results for random rewards don’t apply to the agent-reward case. However, the overall narrative flow seems unnecessarily confusing and could unintentionally mislead a less careful reader (like myself 2 months ago). The title of the post “AUP: Scaling to Superhuman” does not suggest to me that this post introduces a new approach. The term “scaling” usually means making an existing approach work in more realistic / difficult settings, so I think it sets up the expectation that it would be scaling up AUP with random rewards. If the post introduces new problems and a new approach to address them, the title should reflect this. Starting this post by saying “we are pretty close to the impact measurement endgame” seems a bit premature as well. This sentence is also an example of what gave me the impression that you were speaking on behalf of the field (rather than just for yourself) in this sequence.
What I actually said was:
First, the “I think”, and second, the “plausibly”. I think the “plausibly” was appropriate, because in worlds where the CCC is true and you can just straightforwardly implement AUPconceptual (“optimize the objective, without becoming more able to optimize the objective”), you don’t need additional ideas to get a superintelligence-safe impact measure.
Some thoughts on this discussion:
1. Here’s the conceptual comment and the math comment where I’m pessimistic about replacing the auxiliary set with the agent’s own reward.
Then
Hmm, I think you’re misunderstanding Vika’s point here (or at least, I think there is a different point, whether Vika was saying it or not). Here’s the argument, spelled out in more detail:
1. Impact to an arbitrary agent is change in their AU.
2. Therefore, to prevent catastrophe via regularizing impact, we need to have an AI system that is penalized for changing a human’s AU.
3. By assumption, the AI’s utility function RA is different from the human’s RH (otherwise there wouldn’t be any problem).
4. We need to ensure H can pursue RH, but we’re regularizing A pursuing RA. Why should we expect the latter to cause the former to happen?
One possible reason is there’s an underlying factor which is how much power A has, and as long as this is low it implies that any agent (including H) can pursue their own reward about as much as they could in A‘s absence (this is basically CCC). Then, if we believe that regularizing A pursuing RA keeps A’s power low, we would expect it also means that H remains able to pursue RH. I don’t really believe the premise there (unless you regularize so strongly that the agent does nothing).
with respect to my specific proposal in the superintelligent post, or the conceptual version?
Specific proposal.
If the conceptual version is “we keep A’s power low”, then that probably works.
If the conceptual version is “tell A to optimize R without becoming more able to optimize R”, then I have the same objection.
Why do you object to the latter?
I don’t know what it means. How do you optimize for something without becoming more able to optimize for it? If you had said this to me and I hadn’t read your sequence and so knew what you were trying to say, I’d have given you a blank stare—the closest thing I have to an interpretation is “be myopic / greedy”, but that limits your AI system to the point of uselessness.
Like, “optimize for X” means “do stuff over a period of time such that X goes up as much as possible”. “Becoming more able to optimize for X” means “do a thing such that in the future you can do stuff such that X goes up more than it otherwise would have”. The only difference between these two is actions that you can do for immediate reward.
(This is just saying in English what I was arguing for in the math comment.)
if you’re managing a factory, I can say “Rohin, I want you to make me a lot of paperclips this month, but if I find out you’ve increased production capacity or upgraded machines, I’m going to fire you”. You don’t even have to behave greedily – you can plan for possible problems and prevent them, without upgrading your production capacity from where it started.
I think this is a natural concept and is distinct from particular formalizations of it.
edit: consider the three plans
Make 10 paperclips a day
Make 10 paperclips a day, but take over the planet and control a paperclip conglomerate which could turn out millions of paperclips each day, but which in fact never does.
take over the planet and make millions of paperclips each day.
Seems like that only makes sense because you specified that “increasing production capacity” and “upgrading machines” are the things that I’m not allowed to do, and those are things I have a conceptual grasp on. And even then—am I allowed to repair machines that break? What about buying a new factory? What if I force workers to work longer hours? What if I create effective propaganda that causes other people to give you paperclips? What if I figure out that by using a different source of steel I can reduce the defect rate? I am legitimately conceptually uncertain whether these things count as “increasing production capacity / upgrading machines”.
As another example, what does it mean to optimize for “curing cancer” without becoming more able to optimize for “curing cancer”?
Sorry, forgot to reply. I think these are good questions, and I continue to have intuitions that there’s something here, but I want to talk about these points more fully in a later post. Or, think about it more and then explain why I agree with you.