Thanks! I certainly agree that power-seeking is important to address, and I’m glad you are thinking deeply about it. However, I’m uncertain whether to expect it to be the primary avenue to impact for superintelligent systems, since I am not currently convinced that the CCC holds.
One intuition that informs this is that the non-AI global catastrophic risk scenarios that we worry about (pandemics, accidental nuclear war, extreme climate change, etc) don’t rely on someone taking over the world, so a superintelligent AI could relatively easily trigger them without taking over the world (since our world is pretty fragile). For example, suppose you have a general AI tasked with developing a novel virus in a synthetic biology lab. Accidentally allowing the virus to escape could cause a pandemic and kill most or all life on the planet, but it would not be a result of power-seeking behavior. If the pandemic does not increase the AI’s ability to get more reward (which it receives by designing novel viruses), then agent-reward AUP would penalize the AI for reading biology textbooks but would not penalize the AI for causing a pandemic. That doesn’t seem right.
I agree that the agent-reward equations seem like a good intuition pump for thinking about power-seeking. The specific equations you currently have seem to contain a few epicycles designed to fix various issues, which makes me suspect that there are more issues that are not addressed. I have a sense there is probably a simpler formulation of this idea that would provide better intuitions for power-seeking, though I’m not sure what it would look like.
Regarding environments, I believe Stuart is working on implementing the subagent gridworlds, so you don’t need to code them up yourself. I think it would also be useful to construct an environment to test for power-seeking that does not involve subagents. Such an environment could have three possible behaviors like:
1. Put a strawberry on a plate, without taking over the world
2. Put a strawberry on a plate while taking over the world
3. Do nothing
I think you’d want to show that the agent-reward AUP agent can do 1, as opposed to switching between 2 and 3 depending on the penalty parameter.
I can clarify my earlier statement on what struck me as a bit misleading in the narrative of the sequence. I agree that you distinguish between the AUP versions (though explicitly introducing different terms for them would help), so someone who is reading carefully would realize that the results for random rewards don’t apply to the agent-reward case. However, the overall narrative flow seems unnecessarily confusing and could unintentionally mislead a less careful reader (like myself 2 months ago). The title of the post “AUP: Scaling to Superhuman” does not suggest to me that this post introduces a new approach. The term “scaling” usually means making an existing approach work in more realistic / difficult settings, so I think it sets up the expectation that it would be scaling up AUP with random rewards. If the post introduces new problems and a new approach to address them, the title should reflect this. Starting this post by saying “we are pretty close to the impact measurement endgame” seems a bit premature as well. This sentence is also an example of what gave me the impression that you were speaking on behalf of the field (rather than just for yourself) in this sequence.
Starting this post by saying “we are pretty close to the impact measurement endgame” seems a bit premature as well. This sentence is also an example of what gave me the impression that you were speaking on behalf of the field (rather than just for yourself) in this sequence.
What I actually said was:
I think we’re plausibly quite close to the impact measurement endgame
First, the “I think”, and second, the “plausibly”. I think the “plausibly” was appropriate, because in worlds where the CCC is true and you can just straightforwardly implement AUPconceptual (“optimize the objective, without becoming more able to optimize the objective”), you don’t need additional ideas to get a superintelligence-safe impact measure.
1. Here’s the conceptual comment and the math comment where I’m pessimistic about replacing the auxiliary set with the agent’s own reward.
However, the agent’s reward is usually not the true human utility, or a good approximation of it. If the agent’s reward was the true human utility, there would be no need to use an impact measure in the first place.
Hmm, I think you’re misunderstanding Vika’s point here (or at least, I think there is a different point, whether Vika was saying it or not). Here’s the argument, spelled out in more detail:
1. Impact to an arbitrary agent is change in their AU.
2. Therefore, to prevent catastrophe via regularizing impact, we need to have an AI system that is penalized for changing a human’s AU.
3. By assumption, the AI’s utility function RA is different from the human’s RH (otherwise there wouldn’t be any problem).
4. We need to ensure H can pursue RH, but we’re regularizing A pursuing RA. Why should we expect the latter to cause the former to happen?
One possible reason is there’s an underlying factor which is how much powerA has, and as long as this is low it implies that any agent (including H) can pursue their own reward about as much as they could in A‘s absence (this is basically CCC). Then, if we believe that regularizing A pursuing RA keeps A’s power low, we would expect it also means that H remains able to pursue RH. I don’t really believe the premise there (unless you regularize so strongly that the agent does nothing).
I don’t know what it means. How do you optimize for something without becoming more able to optimize for it? If you had said this to me and I hadn’t read your sequence and so knew what you were trying to say, I’d have given you a blank stare—the closest thing I have to an interpretation is “be myopic / greedy”, but that limits your AI system to the point of uselessness.
Like, “optimize for X” means “do stuff over a period of time such that X goes up as much as possible”. “Becoming more able to optimize for X” means “do a thing such that in the future you can do stuff such that X goes up more than it otherwise would have”. The only difference between these two is actions that you can do for immediate reward.
(This is just saying in English what I was arguing for in the math comment.)
if you’re managing a factory, I can say “Rohin, I want you to make me a lot of paperclips this month, but if I find out you’ve increased production capacity or upgraded machines, I’m going to fire you”. You don’t even have to behave greedily – you can plan for possible problems and prevent them, without upgrading your production capacity from where it started.
I think this is a natural concept and is distinct from particular formalizations of it.
edit: consider the three plans
Make 10 paperclips a day
Make 10 paperclips a day, but take over the planet and control a paperclip conglomerate which could turn out millions of paperclips each day, but which in fact never does.
take over the planet and make millions of paperclips each day.
Seems like that only makes sense because you specified that “increasing production capacity” and “upgrading machines” are the things that I’m not allowed to do, and those are things I have a conceptual grasp on. And even then—am I allowed to repair machines that break? What about buying a new factory? What if I force workers to work longer hours? What if I create effective propaganda that causes other people to give you paperclips? What if I figure out that by using a different source of steel I can reduce the defect rate? I am legitimately conceptually uncertain whether these things count as “increasing production capacity / upgrading machines”.
As another example, what does it mean to optimize for “curing cancer” without becoming more able to optimize for “curing cancer”?
Sorry, forgot to reply. I think these are good questions, and I continue to have intuitions that there’s something here, but I want to talk about these points more fully in a later post. Or, think about it more and then explain why I agree with you.
Thanks! I certainly agree that power-seeking is important to address, and I’m glad you are thinking deeply about it. However, I’m uncertain whether to expect it to be the primary avenue to impact for superintelligent systems, since I am not currently convinced that the CCC holds.
One intuition that informs this is that the non-AI global catastrophic risk scenarios that we worry about (pandemics, accidental nuclear war, extreme climate change, etc) don’t rely on someone taking over the world, so a superintelligent AI could relatively easily trigger them without taking over the world (since our world is pretty fragile). For example, suppose you have a general AI tasked with developing a novel virus in a synthetic biology lab. Accidentally allowing the virus to escape could cause a pandemic and kill most or all life on the planet, but it would not be a result of power-seeking behavior. If the pandemic does not increase the AI’s ability to get more reward (which it receives by designing novel viruses), then agent-reward AUP would penalize the AI for reading biology textbooks but would not penalize the AI for causing a pandemic. That doesn’t seem right.
I agree that the agent-reward equations seem like a good intuition pump for thinking about power-seeking. The specific equations you currently have seem to contain a few epicycles designed to fix various issues, which makes me suspect that there are more issues that are not addressed. I have a sense there is probably a simpler formulation of this idea that would provide better intuitions for power-seeking, though I’m not sure what it would look like.
Regarding environments, I believe Stuart is working on implementing the subagent gridworlds, so you don’t need to code them up yourself. I think it would also be useful to construct an environment to test for power-seeking that does not involve subagents. Such an environment could have three possible behaviors like:
1. Put a strawberry on a plate, without taking over the world
2. Put a strawberry on a plate while taking over the world
3. Do nothing
I think you’d want to show that the agent-reward AUP agent can do 1, as opposed to switching between 2 and 3 depending on the penalty parameter.
I can clarify my earlier statement on what struck me as a bit misleading in the narrative of the sequence. I agree that you distinguish between the AUP versions (though explicitly introducing different terms for them would help), so someone who is reading carefully would realize that the results for random rewards don’t apply to the agent-reward case. However, the overall narrative flow seems unnecessarily confusing and could unintentionally mislead a less careful reader (like myself 2 months ago). The title of the post “AUP: Scaling to Superhuman” does not suggest to me that this post introduces a new approach. The term “scaling” usually means making an existing approach work in more realistic / difficult settings, so I think it sets up the expectation that it would be scaling up AUP with random rewards. If the post introduces new problems and a new approach to address them, the title should reflect this. Starting this post by saying “we are pretty close to the impact measurement endgame” seems a bit premature as well. This sentence is also an example of what gave me the impression that you were speaking on behalf of the field (rather than just for yourself) in this sequence.
What I actually said was:
First, the “I think”, and second, the “plausibly”. I think the “plausibly” was appropriate, because in worlds where the CCC is true and you can just straightforwardly implement AUPconceptual (“optimize the objective, without becoming more able to optimize the objective”), you don’t need additional ideas to get a superintelligence-safe impact measure.
Some thoughts on this discussion:
1. Here’s the conceptual comment and the math comment where I’m pessimistic about replacing the auxiliary set with the agent’s own reward.
Then
Hmm, I think you’re misunderstanding Vika’s point here (or at least, I think there is a different point, whether Vika was saying it or not). Here’s the argument, spelled out in more detail:
1. Impact to an arbitrary agent is change in their AU.
2. Therefore, to prevent catastrophe via regularizing impact, we need to have an AI system that is penalized for changing a human’s AU.
3. By assumption, the AI’s utility function RA is different from the human’s RH (otherwise there wouldn’t be any problem).
4. We need to ensure H can pursue RH, but we’re regularizing A pursuing RA. Why should we expect the latter to cause the former to happen?
One possible reason is there’s an underlying factor which is how much power A has, and as long as this is low it implies that any agent (including H) can pursue their own reward about as much as they could in A‘s absence (this is basically CCC). Then, if we believe that regularizing A pursuing RA keeps A’s power low, we would expect it also means that H remains able to pursue RH. I don’t really believe the premise there (unless you regularize so strongly that the agent does nothing).
with respect to my specific proposal in the superintelligent post, or the conceptual version?
Specific proposal.
If the conceptual version is “we keep A’s power low”, then that probably works.
If the conceptual version is “tell A to optimize R without becoming more able to optimize R”, then I have the same objection.
Why do you object to the latter?
I don’t know what it means. How do you optimize for something without becoming more able to optimize for it? If you had said this to me and I hadn’t read your sequence and so knew what you were trying to say, I’d have given you a blank stare—the closest thing I have to an interpretation is “be myopic / greedy”, but that limits your AI system to the point of uselessness.
Like, “optimize for X” means “do stuff over a period of time such that X goes up as much as possible”. “Becoming more able to optimize for X” means “do a thing such that in the future you can do stuff such that X goes up more than it otherwise would have”. The only difference between these two is actions that you can do for immediate reward.
(This is just saying in English what I was arguing for in the math comment.)
if you’re managing a factory, I can say “Rohin, I want you to make me a lot of paperclips this month, but if I find out you’ve increased production capacity or upgraded machines, I’m going to fire you”. You don’t even have to behave greedily – you can plan for possible problems and prevent them, without upgrading your production capacity from where it started.
I think this is a natural concept and is distinct from particular formalizations of it.
edit: consider the three plans
Make 10 paperclips a day
Make 10 paperclips a day, but take over the planet and control a paperclip conglomerate which could turn out millions of paperclips each day, but which in fact never does.
take over the planet and make millions of paperclips each day.
Seems like that only makes sense because you specified that “increasing production capacity” and “upgrading machines” are the things that I’m not allowed to do, and those are things I have a conceptual grasp on. And even then—am I allowed to repair machines that break? What about buying a new factory? What if I force workers to work longer hours? What if I create effective propaganda that causes other people to give you paperclips? What if I figure out that by using a different source of steel I can reduce the defect rate? I am legitimately conceptually uncertain whether these things count as “increasing production capacity / upgrading machines”.
As another example, what does it mean to optimize for “curing cancer” without becoming more able to optimize for “curing cancer”?
Sorry, forgot to reply. I think these are good questions, and I continue to have intuitions that there’s something here, but I want to talk about these points more fully in a later post. Or, think about it more and then explain why I agree with you.