My very general concern is that strategies that maximize RAUP might be very… let’s say creative, and your claims are mostly relying on intuitive arguments for why those strategies won’t be bad for humans.
I don’t really buy the claim that if you’ve been able to patch each specific problem, we’ll soon reach a version with no problems—the exact same inductive argument you mention suggests that there will just be a series of problems, and patches, and then more problems with the patched version. Again, I worry that patches are based a lot on intuition.
For example, in the latest version, because you’re essentially dividing out by the long-term reward of taking the best action now, if the best action now is really really good, then it becomes cheap to take moderately good actions that still increase future reward—which means the agent is incentivized to concentrate the power of actions into specific timsteps. For example, an agent might be able to set things up so that it can sacrifice its ability to achieve total future reward of 1010 to make it cheap to take an action that increases its future reward by 108 . This might looks like sacrificing the ability to colonize distant galaxies in order to gain total control over the Milky Way.
Again, I worry that patches are based a lot on intuition.
If you want your math to abstractly describe reality in a meaningful sense, intuition has to enter somewhere (usually in how you formally define and operationalize the problem of interest). Therefore, I’m interpreting this as “I don’t see good principled intuitions behind the improvements”; please let me know if this is not what you meant.
I claim that, excepting the choice of denominator, all of the improvements follow directly from AUPconceptual (and actually, eq. 1 was the equation with arbitrary choices wrt the AGI case; I started with that because that’s how my published work formalizes the problem).
CCC says catastrophes are caused by power seeking behavior from the agent. Agents are only incentivized to pursue power in order to better achieve their own goals. Therefore, the correct equation should look something like “do your primary goal but be penalized for becoming more able to achieve your primary goal”. In this light, penalizing R-AU is obviously better than using an auxiliary goal, penalizing decreases is obviously irrelevant, and penalizing immediate reward advantage is obviously irrelevant.
The denominator, on the other hand, is indeed the product of meditating on “What kind of elegant rescaling keeps making sense in all sorts of different situations, but also can’t be gamed to arbitrarily decrease the penalty?”.
Right. Some intuition is necessary. But a lot of these choices are ad hoc, by which I mean they aren’t strongly constrained by the result you want from them.
For example, you have a linear penalty governed by this parameter lambda, but in principle it could have been any old function—the only strong constraint is that you want it to monotonically increase from a finite number to infinity. Now, maybe this is fine, or maybe not. But I basically don’t have much trust for meditation in this sort of case, and would rather see explicit constraints that rule out more of the available space.
I basically don’t have much trust for meditation in this sort of case
I’m not asking you to trust in anything, which is why I emphasized that I want people to think more carefully about these choices. I do not think eq. 5 is AGI-safe. I do not think you should put it in an AGI. Do I think there’s a chance it might work? Yes. But we don’t work with “chances”, so it’s not ready.
Anyways, if theorem 11 of the low-hanging fruit post is met, the tradeoff penalty works fine. I also formally explored the hard constraint case and discussed a few reasons why the tradeoff is preferable to the hard constraint. Therefore, I think that particular design choice is reasonably determined. Would you want to think about this more before actually running an AGI with that choice? Of course.
To your broader point, I think there may be another implicit frame difference here. I’m talking about the diff of the progress, considering questions like “are we making a lot of progress? What’s the marginal benefit of more research like this? Are we getting good philosophical returns from this line of work?”, to which I think the answer is yes.
On the other hand, you might be asking “are we there yet?”, and I think the answer to that is no. Notice how these answers don’t contradict each other.
From the first frame, being skeptical because each part of the equation isn’t fully determined seems like an unreasonable demand for rigor. I wrote this sequence because it seemed that my original AUP post was pedagogically bad (I was already thinking about concepts like “overfitting the AU landscape” back in August 2018) and so very few people understood what I was arguing.
I’d like to think that my interpretive labor has paid off: AUP isn’t a slapdash mixture of constraints which is too complicated to be obviously broken, it’s attempting to directly disincentive catastrophes based off of straightforward philosophical reasoning, relying on assumptions and conjectures which I’ve clearly stated. In many cases, I waited weeks so I could formalize my reasoning in the context of MDPs (e.g. why should you think of the AU landscape as a ‘dual’ to the world state? Because I proved it).
There’s always another spot where I could make my claims more rigorous, where I could gather just a bit more evidence. But at some point I have to actually put the posts up, and I think I’ve provided some pretty good evidence in this sequence.
From the second frame, being skeptical because each part of the equation isn’t fully determined is entirely appropriate and something I encourage.
I think you’re writing from something closer to the second frame, but I don’t know for sure. For my part, this sequence has been arguing from the first frame: “towards a new impact measure”, and that’s why I’ve been providing pushback.
My very general concern is that strategies that maximize AUP reward might be very… let’s say creative, and your claims are mostly relying on intuitive arguments for why those strategies won’t be bad for humans.
My argument hinges on CCC being true. If CCC is true, and if we can actually penalize the agent for accumulating power, then if the agent doesn’t want to accumulate power, it’s not incentivized to screw us over. I feel like this is a pretty good intuitive argument, and it’s one I dedicated the first two-thirds of the sequence to explaining. You’re right that it’s intuitive, of course.
I guess our broader disagreement may be “what would an actual solution for impact measurement have going for it at this moment in time?”, and it’s not clear that I’d expect to have formal arguments to this effect / I don’t know how to meet this demand for rigor.
[ETA: I should note that some of my most fruitful work over the last year came from formalizing some of my claims. People were skeptical that slowly decreasing the penalty aggressiveness would work, so I hashed out the math in How Low Should Fruit Hang Before We Pick It?. People were uneasy that the original AUP design relied on instrumental convergence being a thing (eq. 5 doesn’t make that assumption) when maybe it actually isn’t. So I formalized instrumental convergence in Seeking Power is Instrumentally Convergent in MDPs and proved when it exists to at least some extent.
There’s probably more work to be done like this.]
I don’t really buy the claim that if you’ve been able to patch each specific problem, we’ll soon reach a version with no problems—the exact same inductive argument you mention suggests that there will just be a series of problems, and patches, and then more problems with the patched version. Again, I worry that patches are based a lot on intuition.
The claim is dually resting on “we know conceptually how to solve impact measurement / what we want to implement, and it’s a simple and natural idea, so it’s plausible there’s a clean implementation of it”. I think learning “no, there isn’t a clean way to penalize the agent for becoming more able to achieve its own goal” would be quite surprising, but not implausible – I in fact think there’s a significant chance Stuart is right. More on that next post.
Also, you could argue against any approach to AI alignment by pointing out that there are still things to improve and fix, or that there were problems pointed out in the past which were fixed, but now people have found a few more problems. The thing that makes me think the patches might not be endless here is that, as I’ve argued earlier, I think AUP is conceptually correct.
This might looks like sacrificing the ability to colonize distant galaxies in order to gain total control over the Milky Way.
It all depends whether we can get a buffer between catastrophes and reasonable plans here (reasonable plans show up for much less aggressive settings of λ) and I think we can. Now, this particular problem (with huge reward) might not show up because we can bound the reward [0,1], and I generally think there exist reasonable plans where the agent gets at least 20% or so of its maximal return (suppose it thinks there’s a 40% chance we let it get 95% of its maximal per-timestep reward each timestep in exchange for it doing what we want).
[ETA: actually, if the “reasonable” reward is really, really low in expectation, it’s not clear what happens. this might happen if catastrophe befalls us by default.]
You’re right we should inspect the equation for weird incentives, but to a limited extent, this is also something we can test experimentally. We don’t necessarily have to rely on intuition in all cases.
The hope is we can get to a formula that’s simple enough such that all of its incentives are thoroughly understood. I think you’ll agree eq. 5 is far better in this respect than the original AUP formulation!
My very general concern is that strategies that maximize RAUP might be very… let’s say creative, and your claims are mostly relying on intuitive arguments for why those strategies won’t be bad for humans.
I don’t really buy the claim that if you’ve been able to patch each specific problem, we’ll soon reach a version with no problems—the exact same inductive argument you mention suggests that there will just be a series of problems, and patches, and then more problems with the patched version. Again, I worry that patches are based a lot on intuition.
For example, in the latest version, because you’re essentially dividing out by the long-term reward of taking the best action now, if the best action now is really really good, then it becomes cheap to take moderately good actions that still increase future reward—which means the agent is incentivized to concentrate the power of actions into specific timsteps. For example, an agent might be able to set things up so that it can sacrifice its ability to achieve total future reward of 1010 to make it cheap to take an action that increases its future reward by 108 . This might looks like sacrificing the ability to colonize distant galaxies in order to gain total control over the Milky Way.
If you want your math to abstractly describe reality in a meaningful sense, intuition has to enter somewhere (usually in how you formally define and operationalize the problem of interest). Therefore, I’m interpreting this as “I don’t see good principled intuitions behind the improvements”; please let me know if this is not what you meant.
I claim that, excepting the choice of denominator, all of the improvements follow directly from AUPconceptual (and actually, eq. 1 was the equation with arbitrary choices wrt the AGI case; I started with that because that’s how my published work formalizes the problem).
CCC says catastrophes are caused by power seeking behavior from the agent. Agents are only incentivized to pursue power in order to better achieve their own goals. Therefore, the correct equation should look something like “do your primary goal but be penalized for becoming more able to achieve your primary goal”. In this light, penalizing R-AU is obviously better than using an auxiliary goal, penalizing decreases is obviously irrelevant, and penalizing immediate reward advantage is obviously irrelevant.
The denominator, on the other hand, is indeed the product of meditating on “What kind of elegant rescaling keeps making sense in all sorts of different situations, but also can’t be gamed to arbitrarily decrease the penalty?”.
Right. Some intuition is necessary. But a lot of these choices are ad hoc, by which I mean they aren’t strongly constrained by the result you want from them.
For example, you have a linear penalty governed by this parameter lambda, but in principle it could have been any old function—the only strong constraint is that you want it to monotonically increase from a finite number to infinity. Now, maybe this is fine, or maybe not. But I basically don’t have much trust for meditation in this sort of case, and would rather see explicit constraints that rule out more of the available space.
I’m not asking you to trust in anything, which is why I emphasized that I want people to think more carefully about these choices. I do not think eq. 5 is AGI-safe. I do not think you should put it in an AGI. Do I think there’s a chance it might work? Yes. But we don’t work with “chances”, so it’s not ready.
Anyways, if theorem 11 of the low-hanging fruit post is met, the tradeoff penalty works fine. I also formally explored the hard constraint case and discussed a few reasons why the tradeoff is preferable to the hard constraint. Therefore, I think that particular design choice is reasonably determined. Would you want to think about this more before actually running an AGI with that choice? Of course.
To your broader point, I think there may be another implicit frame difference here. I’m talking about the diff of the progress, considering questions like “are we making a lot of progress? What’s the marginal benefit of more research like this? Are we getting good philosophical returns from this line of work?”, to which I think the answer is yes.
On the other hand, you might be asking “are we there yet?”, and I think the answer to that is no. Notice how these answers don’t contradict each other.
From the first frame, being skeptical because each part of the equation isn’t fully determined seems like an unreasonable demand for rigor. I wrote this sequence because it seemed that my original AUP post was pedagogically bad (I was already thinking about concepts like “overfitting the AU landscape” back in August 2018) and so very few people understood what I was arguing.
I’d like to think that my interpretive labor has paid off: AUP isn’t a slapdash mixture of constraints which is too complicated to be obviously broken, it’s attempting to directly disincentive catastrophes based off of straightforward philosophical reasoning, relying on assumptions and conjectures which I’ve clearly stated. In many cases, I waited weeks so I could formalize my reasoning in the context of MDPs (e.g. why should you think of the AU landscape as a ‘dual’ to the world state? Because I proved it).
There’s always another spot where I could make my claims more rigorous, where I could gather just a bit more evidence. But at some point I have to actually put the posts up, and I think I’ve provided some pretty good evidence in this sequence.
From the second frame, being skeptical because each part of the equation isn’t fully determined is entirely appropriate and something I encourage.
I think you’re writing from something closer to the second frame, but I don’t know for sure. For my part, this sequence has been arguing from the first frame: “towards a new impact measure”, and that’s why I’ve been providing pushback.
My argument hinges on CCC being true. If CCC is true, and if we can actually penalize the agent for accumulating power, then if the agent doesn’t want to accumulate power, it’s not incentivized to screw us over. I feel like this is a pretty good intuitive argument, and it’s one I dedicated the first two-thirds of the sequence to explaining. You’re right that it’s intuitive, of course.
I guess our broader disagreement may be “what would an actual solution for impact measurement have going for it at this moment in time?”, and it’s not clear that I’d expect to have formal arguments to this effect / I don’t know how to meet this demand for rigor.
[ETA: I should note that some of my most fruitful work over the last year came from formalizing some of my claims. People were skeptical that slowly decreasing the penalty aggressiveness would work, so I hashed out the math in How Low Should Fruit Hang Before We Pick It?. People were uneasy that the original AUP design relied on instrumental convergence being a thing (eq. 5 doesn’t make that assumption) when maybe it actually isn’t. So I formalized instrumental convergence in Seeking Power is Instrumentally Convergent in MDPs and proved when it exists to at least some extent.
There’s probably more work to be done like this.]
The claim is dually resting on “we know conceptually how to solve impact measurement / what we want to implement, and it’s a simple and natural idea, so it’s plausible there’s a clean implementation of it”. I think learning “no, there isn’t a clean way to penalize the agent for becoming more able to achieve its own goal” would be quite surprising, but not implausible – I in fact think there’s a significant chance Stuart is right. More on that next post.
Also, you could argue against any approach to AI alignment by pointing out that there are still things to improve and fix, or that there were problems pointed out in the past which were fixed, but now people have found a few more problems. The thing that makes me think the patches might not be endless here is that, as I’ve argued earlier, I think AUP is conceptually correct.
It all depends whether we can get a buffer between catastrophes and reasonable plans here (reasonable plans show up for much less aggressive settings of λ) and I think we can. Now, this particular problem (with huge reward) might not show up because we can bound the reward [0,1], and I generally think there exist reasonable plans where the agent gets at least 20% or so of its maximal return (suppose it thinks there’s a 40% chance we let it get 95% of its maximal per-timestep reward each timestep in exchange for it doing what we want).
[ETA: actually, if the “reasonable” reward is really, really low in expectation, it’s not clear what happens. this might happen if catastrophe befalls us by default.]
You’re right we should inspect the equation for weird incentives, but to a limited extent, this is also something we can test experimentally. We don’t necessarily have to rely on intuition in all cases.
The hope is we can get to a formula that’s simple enough such that all of its incentives are thoroughly understood. I think you’ll agree eq. 5 is far better in this respect than the original AUP formulation!