Comments around the section title in bold. Apologies for length, but this was a pretty long post, too! I wrote this in order, while reading, so I often mention something that you address later.
Intuition Pumps:
There are well-known issues with needing a special “Status quo” state. Figuring out what humans would consider the “default” action and then using the right method of counterfactually evaluating its macro-scale effects (without simulating the effects of confused programmers wondering why it turned itself off, or similar counterfactual artifacts) is an unsolved problem. But we can pretend it’s solved for now.
Notation:
Notationally, it’s a little weird to me that $Q_u$ doesn’t mention the timescale (e.g. $Q_u^{(m)}$). Are you implying that the choice of $m$ can be arbitrary and you’ll therefore just assume $m$ is some constant?
Theorem 1 doesn’t rely on any special properties of the default action—whenever $Q_{u_A}$ is different after any two actions, you can find another $u$ such that $Q_u$ is also different after those actions as long as your set $U$ is closed under one of several simple operations (including multiplying by −1!).
If you want attainable utility to be 0 after the agent shuts itself off, then your terminology was confusing. It would make more sense for me if you phrased this whole thing in terms of *reward* - which is what I would normally call a number that the agent receives at each time step. So rather than saying that $u(empty tape)=0$, which makes it sound to me like you’re talking about the utility of a history that is entirely empty, I would understand it faster if you said that $r(h_{t-1},\{empty tape\})=0$ (reward is zero whenever the most recent timestep is an empty tape).
In fact, even this talk about empty tape is a little bit misleading, because this agent doesn’t necessarily have an understanding of itself as having a physical embodiment with actual tape. It might be more accurate to say that you’re saying that in this environment there’s a special observation $O$ (the agent gets shut off, or the “empty tape observation”) that is always followed by another $O$ no matter what action is taken, and your class of reward functions is required to assign zero reward to states with last observation $O$.
Change in Expected Attainable Utility:
Here we have the meat. The “impact” term is: for an average $u$ in your set (weighted by simplicity), how much does taking this action change $Q_u$, relative to taking the default action?
Just pausing here and thinking about whether this is doing the intended thing, and whether it might also do unintended things. The intended thing is something like to prevent UFAI from releasing a deadly superflu. If $m$ is shorter than the length scale the agent can plan over, it can get around the restriction by taking an action that it knows will be good but has low penalty. The numbers assigned might be misleading—by making everything be the same from a low-complexity perspective, an agent might be able to achieve a big impact on a high-complexity goal. And conversely, the numbers might be wrong because we don’t understand the scaling of the AI’s “primary” reward—if it sees a plan that could lead to 3^^^3 paperclips, it’s probably just going to do it, if it ever does anything at all (That is, if you look at all the cases with return rate reward/impact > 1, there might be lots that have high reward [well, relative to other rewards] and high impact). Also if there are any of those aforementioned surprises in its counterfactual reasoning about what happens if it takes the default action, those show up here as corresponding surprises in what it thinks it’s allowed to do.
Will it add unintended goals? If the penalty coefficient is too large, will there be bad behavior? Nope (good job!) - in the limit of large penalty, it’s just going to take the default action.
But imagine a car driving down a narrow tunnel that then opens up into an intersection. A penalty scaling that is optimal in the tunnel will cause the car to shut down as soon as it comes near the intersection. That doesn’t seem like what I want.
Using ImpactUnit is clever, but it doesn’t fully solve setting the penalty coefficient because it doesn’t say anything about the scale of the reward function. If the reward function has lots of tiny numbers, they’ll be smaller than the penalty numbers, and vice versa if lots of numbers are near the top of the range. Your goal is to set the penalty coefficient right in the middle ground between prohibiting everything and allowing everything (assuming this middle ground has the properties your want, see concerns above), but there still seems to be guesswork involved.
Wait… is ImpactUnit an actual, physical machine that has to be capable of actually producing the effect? Does the agent have any incentive to preserve the physical impact unit? I guess this is a reason to update ImpactUnit during the course of plans and use the minimum recorded ImpactUnit, but still, this seems ugly.
Modified Utility:
Again, talking about this as a reward makes more sense than calling it utility for me. The penalty to the total reward of a plan is the sum of the penalty at each time step. Great.
I recognize and appreciate the effort put into explaining, but this particular explanation in terms of $h_inaction$ and $h_action$ plans didn’t really do it for me. Like I said, I just think of it in terms of having a reward and a penalty at each timestep of the plan, as in the equation.
Examples:
Yayyy, these are so nice!
The shutdown button example is very elegant.
All these utility functions sidestep my concerns by all being very statistically similar, all being similar complexity, and all having convenient scales relative to ImpactUnit. I’d be interested in what the limits to changing these things are before you start getting bad behavior.
Great experiments!
The Beware of Dog experiment doesn’t really show what you say it shows, because the agent could have gone at an earlier cycle of the dog. The agent might just be going whenever, with no particular preference as long as it gets good reward.
AUP Unbound:
I think thinking about convergent instrumental goals (e.g. gathering money, when you have many possible utility functions to spend it on) really highlights an issue. If the penalty is too high, you collect no money, because the power it gives you is more than the benefit. But if the penalty is even slightly lower than the tipping point, you’ll collect as much money as you can, because it’s worth more in utility than it changes the weighted sum of other utilities. Because the utility function is bounded, if you have a utility function that’s shaped differently from the average you might get some sticking points and end up satisfied with a finite amount of money, but then again, you might not (if the initial difference in slope from the average is the largest sticking point).
As stated, the penalty calculation runs on pure correlation. So anything that “influences the agent’s action” in an EDT-violating way, or that depends on the output of the agent’s computation itself (e.g. XOR blackmail) will give a weird (possibly undefined or unprincipled) result.
Nknown:
An unaligned agent that is having its N slowly increased by humans is going to do nothing until it is far, far too late. This is because the default action still leads to its N being increased, which seems to me like a pretty big power gain to me, so it will only act if acting can give it a similarly big power gain.
It might be possible to do this *relatively* safely by starting an agent, testing it on some tasks, and then turning it off and starting over.
Also I’m not sure these agents won’t acausally cooperate. They’re maximizing the utility times the probability of getting the history conditional on the action. This seems like the sort of setup that, if it was playing the prisoner’s dilemma against itself, might realize that P(observe cooperate | take the cooperate action) is large, and not worry too much about whether this correlation was causal or acausal.
Desiderata:
I think you’re giving out checkmarks too easily. What seem to you like minor details that just need a little straightening up will, a third of the time every time, contain hidden gotchas. That’s just how these things go.
Overall, I was very impressed! I definitely didn’t think this was going to have as nice properties as it does, at the start. I’m of the opinion that low-impact and corrigibility seem harder than the value loading problem itself (though easier to test and less bad to screw up), so I’m impressed by this progress even though I think there’s lots of room for improvement. I also thought the explanations and examples were really well-done. The target audience has to be willing to read through a pretty long post to get the gist of it, but TBH that’s probably fine (though academics do have to promote complicated work in shorter formats as well, like press releases, posters, 10-minute talks, etc.). I’ll probably have more to say about this later after a little digesting.
There are well-known issues with needing a special “Status quo” state. Figuring out what humans would consider the “default” action and then using the right method of counterfactually evaluating its macro-scale effects (without simulating the effects of confused programmers wondering why it turned itself off, or similar counterfactual artifacts) is an unsolved problem. But we can pretend it’s solved for now.
On the contrary, the approach accounts for—and in fact, benefits from—counterfactual reactions. Counterfactual actions we ideally make are quite natural: shutting the agent down if it does things we don’t like, and not shutting it down before the end of the epoch if it stops doing things entirely (an unsurprising reaction to low impact agents). As you probably later noticed, we just specify the standby action.
One exception to this is the long term penalty noise imposed by slight variation in our propensity to shut down the agent, which I later flag as a potential problem.
[there is change] as long as your set U is closed under one of several simple operations (including multiplying by −1!).
False, as I understand it. This is a misconception I’ve heard from multiple people – including myself, the first time I thought to prove this. Consider again the line:
Suppose u rates trajectories in which it ends up in A, B, or C as −1, and in D as 1, and that \lnot u := -u. If the agent is at A and m=2, moving right increases Q_u while keeping Q_{\lnot u} constant.
u(emptytape)=0, which makes it sound to me like you’re talking about the utility of a history that is entirely empty
I am.
In fact, even this talk about empty tape is a little bit misleading, because this agent doesn’t necessarily have an understanding of itself as having a physical embodiment with actual tape.
We’re only (formally) talking about a Cartesian agent, right?
The numbers assigned might be misleading—by making everything be the same from a low-complexity perspective, an agent might be able to achieve a big impact on a high-complexity goal. And conversely, the numbers might be wrong because we don’t understand the scaling of the AI’s “primary” reward—if it sees a plan that could lead to 3^^^3 paperclips, it’s probably just going to do it, if it ever does anything at all
I’m not sure whether you still have this thought later, but the first is addressed by my comments in “utility selection”. Secondly, the primary u_A is also bound [0,1].
But imagine a car driving down a narrow tunnel that then opens up into an intersection. A penalty scaling that is optimal in the tunnel will cause the car to shut down as soon as it comes near the intersection. That doesn’t seem like what I want.
This is more related to the question of “how can it do things where interruption would be impactful?” A chauffeur-u_A agent wouldn’t bother going down the tunnel itself, and would probably just make a self-driving car that would only require one activation action. This works if it predicts that the effect of activating the car would be low impact (and also not make us more or less likely to shut it down), it’ll do that. I don’t see a problem with the penalty scaling here, but maybe I haven’t quite understood your point.
Wait… is ImpactUnit an actual, physical machine that has to be capable of actually producing the effect? Does the agent have any incentive to preserve the physical impact unit? I guess this is a reason to update ImpactUnit during the course of plans and use the minimum recorded ImpactUnit, but still, this seems ugly.
Yes, and provably yes (as in, it’ll never increase it on purpose). Why does this seem ugly? It has a reference action that immediately uses a tiny amount of resources; this then lets us define a budget.
The Beware of Dog experiment doesn’t really show what you say it shows, because the agent could have gone at an earlier cycle of the dog.
I checked this by increasing plan length—it is indeed waiting until near the end of the plan.
But if the penalty is even slightly lower than the tipping point, you’ll collect as much money as you can, because it’s worth more in utility than it changes the weighted sum of other utilities.
I don’t understand why this isn’t taken care of by u_A being bounded. Diminishing returns will kick in at some point, and in any case we proved that the agent will never choose to have more than N•ImpactUnit of impact.
As stated, the penalty calculation runs on pure correlation. So anything that “influences the agent’s action” in an EDT-violating way, or that depends on the output of the agent’s computation itself (e.g. XOR blackmail) will give a weird (possibly undefined or unprincipled) result.
I don’t see why, but I also don’t know much DT yet. I’ll defer discussion of this matter to others. Alternatively, ask me in a few months?
An unaligned agent that is having its N slowly increased by humans is going to do nothing until it is far, far too late. This is because the default action still leads to its N being increased, which seems to me like a pretty big power gain to me, so it will only act if acting can give it a similarly big power gain.
First, the agent grades future plans using its present N. Second, this isn’t a power gain, since none of the U_A utilities are AUP—how would this help arbitrary maximizers wirehead? Third, agents with different N are effectively maximizing different objectives.
Also I’m not sure these agents won’t acausally cooperate.
They might, you’re correct. What’s important is that they won’t be able to avoid penalty by acausally cooperating.
I think you’re giving out checkmarks too easily. What seem to you like minor details that just need a little straightening up will, a third of the time every time, contain hidden gotchas.
This is definitely a fair point. My posterior on handling these “gotcha”s for AUP is that fixes are rather easily derivable – this is mostly a function of my experience thus far. It’s certainly possible that we will run across something that AUP is fundamentally unable to overcome, but I do not find that very likely right now. In any case, I hope that the disclaimer I provided before the checkmarks reinforced the idea that not all of these have been rock-solid proven at this point.
Comments around the section title in bold. Apologies for length, but this was a pretty long post, too! I wrote this in order, while reading, so I often mention something that you address later.
Intuition Pumps:
There are well-known issues with needing a special “Status quo” state. Figuring out what humans would consider the “default” action and then using the right method of counterfactually evaluating its macro-scale effects (without simulating the effects of confused programmers wondering why it turned itself off, or similar counterfactual artifacts) is an unsolved problem. But we can pretend it’s solved for now.
Notation:
Notationally, it’s a little weird to me that $Q_u$ doesn’t mention the timescale (e.g. $Q_u^{(m)}$). Are you implying that the choice of $m$ can be arbitrary and you’ll therefore just assume $m$ is some constant?
Theorem 1 doesn’t rely on any special properties of the default action—whenever $Q_{u_A}$ is different after any two actions, you can find another $u$ such that $Q_u$ is also different after those actions as long as your set $U$ is closed under one of several simple operations (including multiplying by −1!).
If you want attainable utility to be 0 after the agent shuts itself off, then your terminology was confusing. It would make more sense for me if you phrased this whole thing in terms of *reward* - which is what I would normally call a number that the agent receives at each time step. So rather than saying that $u(empty tape)=0$, which makes it sound to me like you’re talking about the utility of a history that is entirely empty, I would understand it faster if you said that $r(h_{t-1},\{empty tape\})=0$ (reward is zero whenever the most recent timestep is an empty tape).
In fact, even this talk about empty tape is a little bit misleading, because this agent doesn’t necessarily have an understanding of itself as having a physical embodiment with actual tape. It might be more accurate to say that you’re saying that in this environment there’s a special observation $O$ (the agent gets shut off, or the “empty tape observation”) that is always followed by another $O$ no matter what action is taken, and your class of reward functions is required to assign zero reward to states with last observation $O$.
Change in Expected Attainable Utility:
Here we have the meat. The “impact” term is: for an average $u$ in your set (weighted by simplicity), how much does taking this action change $Q_u$, relative to taking the default action?
Just pausing here and thinking about whether this is doing the intended thing, and whether it might also do unintended things. The intended thing is something like to prevent UFAI from releasing a deadly superflu. If $m$ is shorter than the length scale the agent can plan over, it can get around the restriction by taking an action that it knows will be good but has low penalty. The numbers assigned might be misleading—by making everything be the same from a low-complexity perspective, an agent might be able to achieve a big impact on a high-complexity goal. And conversely, the numbers might be wrong because we don’t understand the scaling of the AI’s “primary” reward—if it sees a plan that could lead to 3^^^3 paperclips, it’s probably just going to do it, if it ever does anything at all (That is, if you look at all the cases with return rate reward/impact > 1, there might be lots that have high reward [well, relative to other rewards] and high impact). Also if there are any of those aforementioned surprises in its counterfactual reasoning about what happens if it takes the default action, those show up here as corresponding surprises in what it thinks it’s allowed to do.
Will it add unintended goals? If the penalty coefficient is too large, will there be bad behavior? Nope (good job!) - in the limit of large penalty, it’s just going to take the default action.
But imagine a car driving down a narrow tunnel that then opens up into an intersection. A penalty scaling that is optimal in the tunnel will cause the car to shut down as soon as it comes near the intersection. That doesn’t seem like what I want.
Using ImpactUnit is clever, but it doesn’t fully solve setting the penalty coefficient because it doesn’t say anything about the scale of the reward function. If the reward function has lots of tiny numbers, they’ll be smaller than the penalty numbers, and vice versa if lots of numbers are near the top of the range. Your goal is to set the penalty coefficient right in the middle ground between prohibiting everything and allowing everything (assuming this middle ground has the properties your want, see concerns above), but there still seems to be guesswork involved.
Wait… is ImpactUnit an actual, physical machine that has to be capable of actually producing the effect? Does the agent have any incentive to preserve the physical impact unit? I guess this is a reason to update ImpactUnit during the course of plans and use the minimum recorded ImpactUnit, but still, this seems ugly.
Modified Utility:
Again, talking about this as a reward makes more sense than calling it utility for me. The penalty to the total reward of a plan is the sum of the penalty at each time step. Great.
I recognize and appreciate the effort put into explaining, but this particular explanation in terms of $h_inaction$ and $h_action$ plans didn’t really do it for me. Like I said, I just think of it in terms of having a reward and a penalty at each timestep of the plan, as in the equation.
Examples:
Yayyy, these are so nice!
The shutdown button example is very elegant.
All these utility functions sidestep my concerns by all being very statistically similar, all being similar complexity, and all having convenient scales relative to ImpactUnit. I’d be interested in what the limits to changing these things are before you start getting bad behavior.
Great experiments!
The Beware of Dog experiment doesn’t really show what you say it shows, because the agent could have gone at an earlier cycle of the dog. The agent might just be going whenever, with no particular preference as long as it gets good reward.
AUP Unbound:
I think thinking about convergent instrumental goals (e.g. gathering money, when you have many possible utility functions to spend it on) really highlights an issue. If the penalty is too high, you collect no money, because the power it gives you is more than the benefit. But if the penalty is even slightly lower than the tipping point, you’ll collect as much money as you can, because it’s worth more in utility than it changes the weighted sum of other utilities. Because the utility function is bounded, if you have a utility function that’s shaped differently from the average you might get some sticking points and end up satisfied with a finite amount of money, but then again, you might not (if the initial difference in slope from the average is the largest sticking point).
As stated, the penalty calculation runs on pure correlation. So anything that “influences the agent’s action” in an EDT-violating way, or that depends on the output of the agent’s computation itself (e.g. XOR blackmail) will give a weird (possibly undefined or unprincipled) result.
Nknown:
An unaligned agent that is having its N slowly increased by humans is going to do nothing until it is far, far too late. This is because the default action still leads to its N being increased, which seems to me like a pretty big power gain to me, so it will only act if acting can give it a similarly big power gain.
It might be possible to do this *relatively* safely by starting an agent, testing it on some tasks, and then turning it off and starting over.
Also I’m not sure these agents won’t acausally cooperate. They’re maximizing the utility times the probability of getting the history conditional on the action. This seems like the sort of setup that, if it was playing the prisoner’s dilemma against itself, might realize that P(observe cooperate | take the cooperate action) is large, and not worry too much about whether this correlation was causal or acausal.
Desiderata:
I think you’re giving out checkmarks too easily. What seem to you like minor details that just need a little straightening up will, a third of the time every time, contain hidden gotchas. That’s just how these things go.
Overall, I was very impressed! I definitely didn’t think this was going to have as nice properties as it does, at the start. I’m of the opinion that low-impact and corrigibility seem harder than the value loading problem itself (though easier to test and less bad to screw up), so I’m impressed by this progress even though I think there’s lots of room for improvement. I also thought the explanations and examples were really well-done. The target audience has to be willing to read through a pretty long post to get the gist of it, but TBH that’s probably fine (though academics do have to promote complicated work in shorter formats as well, like press releases, posters, 10-minute talks, etc.). I’ll probably have more to say about this later after a little digesting.
Thanks so much for the detailed commentary!
On the contrary, the approach accounts for—and in fact, benefits from—counterfactual reactions. Counterfactual actions we ideally make are quite natural: shutting the agent down if it does things we don’t like, and not shutting it down before the end of the epoch if it stops doing things entirely (an unsurprising reaction to low impact agents). As you probably later noticed, we just specify the standby action.
One exception to this is the long term penalty noise imposed by slight variation in our propensity to shut down the agent, which I later flag as a potential problem.
False, as I understand it. This is a misconception I’ve heard from multiple people – including myself, the first time I thought to prove this. Consider again the line:
Suppose u rates trajectories in which it ends up in A, B, or C as −1, and in D as 1, and that \lnot u := -u. If the agent is at A and m=2, moving right increases Q_u while keeping Q_{\lnot u} constant.
I am.
We’re only (formally) talking about a Cartesian agent, right?
I’m not sure whether you still have this thought later, but the first is addressed by my comments in “utility selection”. Secondly, the primary u_A is also bound [0,1].
This is more related to the question of “how can it do things where interruption would be impactful?” A chauffeur-u_A agent wouldn’t bother going down the tunnel itself, and would probably just make a self-driving car that would only require one activation action. This works if it predicts that the effect of activating the car would be low impact (and also not make us more or less likely to shut it down), it’ll do that. I don’t see a problem with the penalty scaling here, but maybe I haven’t quite understood your point.
Yes, and provably yes (as in, it’ll never increase it on purpose). Why does this seem ugly? It has a reference action that immediately uses a tiny amount of resources; this then lets us define a budget.
I checked this by increasing plan length—it is indeed waiting until near the end of the plan.
I don’t understand why this isn’t taken care of by u_A being bounded. Diminishing returns will kick in at some point, and in any case we proved that the agent will never choose to have more than N•ImpactUnit of impact.
I don’t see why, but I also don’t know much DT yet. I’ll defer discussion of this matter to others. Alternatively, ask me in a few months?
First, the agent grades future plans using its present N. Second, this isn’t a power gain, since none of the U_A utilities are AUP—how would this help arbitrary maximizers wirehead? Third, agents with different N are effectively maximizing different objectives.
They might, you’re correct. What’s important is that they won’t be able to avoid penalty by acausally cooperating.
This is definitely a fair point. My posterior on handling these “gotcha”s for AUP is that fixes are rather easily derivable – this is mostly a function of my experience thus far. It’s certainly possible that we will run across something that AUP is fundamentally unable to overcome, but I do not find that very likely right now. In any case, I hope that the disclaimer I provided before the checkmarks reinforced the idea that not all of these have been rock-solid proven at this point.