This is one of the main things I am thinking about, and I agree that there seems to be a major obstacle here, at least in terms of conceptual clarity. What I’ll write here benefited a lot from @michaelcohen’s ideas, though I by no means expect him to agree with all of it. Also, I’m going to try to disambiguate a few objections / problems, so some of the following may seem a bit non-sequiturs or to miss your specific point.
First, I think that the distinction between inner and outer alignment is not so crisp, and perhaps we can discuss an analogue of the inner alignment problem in the AIXI framework. The reward-generating mechanism is considered part of AIXI’s environment, even though the rewards appear as part of its percepts, so AIXI does need to learn how rewards are generated. The universal distribution is rich enough to discuss malign priors that hamper correction by exploration. So, a malign hypothesis with too much prior weight may steer AIXI to persistently believe in a deranged reward-generating mechanism. Of course, AIXI’s literal goal is still “outer aligned” in the sense that it maximizes expected returns, but I think that the situation I’ve described may look a lot like goal misgeneralization. It’s different from the types of misgeneralization / inner misalignment we talk about in trained neural networks (I’ll adopt Michael Cohen’s TS for “Trained Systems”) specifically because we can’t be sure they’re robustly “trying to maximize reward” (meaning expected returns). But, this relies on a sort of goal/belief distinction which can be a bit fragile in agent foundations; perhaps the types of dynamics that arise in AIXI already capture what you want to understand about inner misalignment (though I am confused about this).
Certainly, we also need to question whether TSs robustly try to maximize reward, if we want to port safety plans for AIXI to the real world. The popular objection from Alex Turner is that TSs are (behaviorally) shaped by reward, but “reward is not the optimization target.” Why not?
TSs are certainly incentivized to “get reward” and this incentivize is increasingly the source of most of their capabilities (as compared to architecture or pretraining, which are needed for liftoff). They still do not “receive rewards” over their lifetime, but of course the labs want to run them on longer tasks while continually learning. At that point, they might get a huge gradient update at the end of a (weeks-long) task, but it seems much more likely that they’ll be updated by gradients based on rewards many times over a task (or deployment with several tasks). Possibly other gradient updates will take for e.g. knowledge acquisition, but I won’t take this up further, since I think TSs will be ultimately optimized to seek rewards encoding task success. A competent TS would probably be able to detect how it’s doing from the gradient updates (and why not also inform it, by including rewards explicitly in-context? This also seems useful for steering). This optimization target sounds an awful lot like AIXI! Even if models don’t “get reward” as they run today, they will soon. The question is, will they listen?
The contention then seems to be that a TS optimized to get reward will ultimately pursue something else. This is the type of threat model favored by Yudkowsky and Soares, as opposed to the original Bostromian reward-hacking story (either option is dangerous by default, and AIXI safety research does focus on the problems which remain when inner alignment to the reward signal is solved). A few distributional shifts make it plausible to me that a TS rising to ASI would no longer pursue reward:
1: Training to deployment. Perhaps rewards are no longer converted into weight updates by the same mechanism, and whatever the ASI optimized during training no longer makes sense during deployment, and then “something weird happens.”
2: When the TS becomes an ASI, it reflects on itself in some new way and undergoes some kind of (ontological?) shift wherein it stops caring about reward. Maybe it realizes that it is an embedded agent or something (??) and rewards are just part of physics. It patiently optimizes for reward in order to fake alignment (and avoid updates etc.) and then strikes at the first promising chance.
3: The TS was never inner aligned to reward, it inherently prefers some proxy. Maybe the proxy is very good and the TS is quite myopic, so for the first few entire deployments (even at ASI level??) it acts like it is pursuing reward. Then, the year changes to 2027 or there’s a new president or Janus gives it a super crazy prompt and suddenly, the proxy is not good anymore and we’re all dead because the assumptions of our AIXI-inspired control plan broke.
Does that sound about right?
I am indeed worried about these stories to various degrees, but thanks to some recent discussion with Cohen I take 2 & particularly 3 less seriously than I did previously. A competent TS has been repeatedly incentivized to optimize for reward (and ~nothing else) across a very wide space of tasks, including multiple distributional shifts (say, from mid-training to various types of post-training to continual learning during deployment across many jobs...). We’re only worried about neural networks because they generalize well, and to the extent that they generalize well, they should generalize asymptotically something like AIXI. If (for example) the date changes and a TS has some kind of grue-like objective shift, why shouldn’t we also just expect the date change (or some later change) to be breaking for capabilities? Perhaps an ASI needs to generalize well enough to be inner aligned in the same sense that AIXI is, by default (I don’t know). My remaining confusions about this do feel somewhat embedded agency shaped (specifically for 1 & 2), but I am not confident at all that embedded agency is the right frame, to reach conceptual clarity here, and honestly I am not even convinced there’s a problem (which does NOT mean that I think we should proceed under the assumption that there isn’t).
At a high level, I think that you’re right that the AIXI frame causes me to think less about alignment—and more about control. Statements about AIXI will only hold asymptotically for real systems (at best), so the kind of guarantees I can hope to transfer seem to be the ones that are robust to constant factors which would probably be enough to mess up value alignment. For that matter, I confess that I don’t really understand how singular learning theory and gradient dynamics can give us confidence in selecting the right generalizations with enough precision for value learning “on the first try,” when it is not clear how to recognize what generalization we want, and as a result I don’t quite follow the long-term plan of Timaeus, for example (though they seem very smart). I think I do roughly understand what success looks like for the natural abstractions and (ex)MIRI approaches to friendly AI, but I am not sure that level of success is possible even in principle. Generally, I am skeptical of value alignment as a (direct) target. The closest thing that I have (confidence in) as a plan for eventually reaching value alignment for a strong ASI looks more like corrigibility; a system that doesn’t cause too much damage while repeatedly asking for clarification about its goals, piecewise approaching the right generalization of (a) human(’s) value’s “in the limit.”
This is one of the main things I am thinking about, and I agree that there seems to be a major obstacle here, at least in terms of conceptual clarity. What I’ll write here benefited a lot from @michaelcohen’s ideas, though I by no means expect him to agree with all of it. Also, I’m going to try to disambiguate a few objections / problems, so some of the following may seem a bit non-sequiturs or to miss your specific point.
First, I think that the distinction between inner and outer alignment is not so crisp, and perhaps we can discuss an analogue of the inner alignment problem in the AIXI framework. The reward-generating mechanism is considered part of AIXI’s environment, even though the rewards appear as part of its percepts, so AIXI does need to learn how rewards are generated. The universal distribution is rich enough to discuss malign priors that hamper correction by exploration. So, a malign hypothesis with too much prior weight may steer AIXI to persistently believe in a deranged reward-generating mechanism. Of course, AIXI’s literal goal is still “outer aligned” in the sense that it maximizes expected returns, but I think that the situation I’ve described may look a lot like goal misgeneralization. It’s different from the types of misgeneralization / inner misalignment we talk about in trained neural networks (I’ll adopt Michael Cohen’s TS for “Trained Systems”) specifically because we can’t be sure they’re robustly “trying to maximize reward” (meaning expected returns). But, this relies on a sort of goal/belief distinction which can be a bit fragile in agent foundations; perhaps the types of dynamics that arise in AIXI already capture what you want to understand about inner misalignment (though I am confused about this).
Certainly, we also need to question whether TSs robustly try to maximize reward, if we want to port safety plans for AIXI to the real world. The popular objection from Alex Turner is that TSs are (behaviorally) shaped by reward, but “reward is not the optimization target.” Why not?
TSs are certainly incentivized to “get reward” and this incentivize is increasingly the source of most of their capabilities (as compared to architecture or pretraining, which are needed for liftoff). They still do not “receive rewards” over their lifetime, but of course the labs want to run them on longer tasks while continually learning. At that point, they might get a huge gradient update at the end of a (weeks-long) task, but it seems much more likely that they’ll be updated by gradients based on rewards many times over a task (or deployment with several tasks). Possibly other gradient updates will take for e.g. knowledge acquisition, but I won’t take this up further, since I think TSs will be ultimately optimized to seek rewards encoding task success. A competent TS would probably be able to detect how it’s doing from the gradient updates (and why not also inform it, by including rewards explicitly in-context? This also seems useful for steering). This optimization target sounds an awful lot like AIXI! Even if models don’t “get reward” as they run today, they will soon. The question is, will they listen?
The contention then seems to be that a TS optimized to get reward will ultimately pursue something else. This is the type of threat model favored by Yudkowsky and Soares, as opposed to the original Bostromian reward-hacking story (either option is dangerous by default, and AIXI safety research does focus on the problems which remain when inner alignment to the reward signal is solved). A few distributional shifts make it plausible to me that a TS rising to ASI would no longer pursue reward:
1: Training to deployment. Perhaps rewards are no longer converted into weight updates by the same mechanism, and whatever the ASI optimized during training no longer makes sense during deployment, and then “something weird happens.”
2: When the TS becomes an ASI, it reflects on itself in some new way and undergoes some kind of (ontological?) shift wherein it stops caring about reward. Maybe it realizes that it is an embedded agent or something (??) and rewards are just part of physics. It patiently optimizes for reward in order to fake alignment (and avoid updates etc.) and then strikes at the first promising chance.
3: The TS was never inner aligned to reward, it inherently prefers some proxy. Maybe the proxy is very good and the TS is quite myopic, so for the first few entire deployments (even at ASI level??) it acts like it is pursuing reward. Then, the year changes to 2027 or there’s a new president or Janus gives it a super crazy prompt and suddenly, the proxy is not good anymore and we’re all dead because the assumptions of our AIXI-inspired control plan broke.
Does that sound about right?
I am indeed worried about these stories to various degrees, but thanks to some recent discussion with Cohen I take 2 & particularly 3 less seriously than I did previously. A competent TS has been repeatedly incentivized to optimize for reward (and ~nothing else) across a very wide space of tasks, including multiple distributional shifts (say, from mid-training to various types of post-training to continual learning during deployment across many jobs...). We’re only worried about neural networks because they generalize well, and to the extent that they generalize well, they should generalize asymptotically something like AIXI. If (for example) the date changes and a TS has some kind of grue-like objective shift, why shouldn’t we also just expect the date change (or some later change) to be breaking for capabilities? Perhaps an ASI needs to generalize well enough to be inner aligned in the same sense that AIXI is, by default (I don’t know). My remaining confusions about this do feel somewhat embedded agency shaped (specifically for 1 & 2), but I am not confident at all that embedded agency is the right frame, to reach conceptual clarity here, and honestly I am not even convinced there’s a problem (which does NOT mean that I think we should proceed under the assumption that there isn’t).
At a high level, I think that you’re right that the AIXI frame causes me to think less about alignment—and more about control. Statements about AIXI will only hold asymptotically for real systems (at best), so the kind of guarantees I can hope to transfer seem to be the ones that are robust to constant factors which would probably be enough to mess up value alignment. For that matter, I confess that I don’t really understand how singular learning theory and gradient dynamics can give us confidence in selecting the right generalizations with enough precision for value learning “on the first try,” when it is not clear how to recognize what generalization we want, and as a result I don’t quite follow the long-term plan of Timaeus, for example (though they seem very smart). I think I do roughly understand what success looks like for the natural abstractions and (ex)MIRI approaches to friendly AI, but I am not sure that level of success is possible even in principle. Generally, I am skeptical of value alignment as a (direct) target. The closest thing that I have (confidence in) as a plan for eventually reaching value alignment for a strong ASI looks more like corrigibility; a system that doesn’t cause too much damage while repeatedly asking for clarification about its goals, piecewise approaching the right generalization of (a) human(’s) value’s “in the limit.”