My take on the catch-22 on wireheading vs inner-alignment point is that I favor inner-aligning AIs to the reward function and take the low risk of wireheading than allowing inner misalignment to undo all our work.
My basic reasons for choosing the wirehead option over the inner-alignment option are based on my following views:
I think translation of our intent to machines without Goodharting too much is IMO likely reasonably easy to do by default, and this broadly is due to both being more willing to assume alignment generalizes from easy to hard cases than other people on LW, and the method used is essentially creating large datasets that do encode human values, which turns out to work surprisingly well for reasons that somewhat generalize to the harder cases:
As a corollary of this, this also means that we can create robust reward functions out of that dataset that prevent a lot of common reward hacking strategies, and so long as the values were trained in early such that it internalized it’s values before it’s superhumanly capable at everything means we can let instrumental convergence do the rest of the work for us.
The main principle here is to put alignment data before or during capabilities data, not after. This is why RLHF and a whole lot of other post-training methods for alignment are so bad.
We really shouldn’t screw up our outer alignment success by then introducing a inner-misaligned agent inside our AI.
Thankfully, I now see a clear answer to the question of whether we should promote wireheading over inner misalignment, and the choice is to wirehead a robustly good reward function than it is to get into the ocean of inner misalignment.
My take on the catch-22 on wireheading vs inner-alignment point is that I favor inner-aligning AIs to the reward function and take the low risk of wireheading than allowing inner misalignment to undo all our work.
My basic reasons for choosing the wirehead option over the inner-alignment option are based on my following views:
I think translation of our intent to machines without Goodharting too much is IMO likely reasonably easy to do by default, and this broadly is due to both being more willing to assume alignment generalizes from easy to hard cases than other people on LW, and the method used is essentially creating large datasets that do encode human values, which turns out to work surprisingly well for reasons that somewhat generalize to the harder cases:
https://www.lesswrong.com/posts/oJQnRDbgSS8i6DwNu/the-hopium-wars-the-agi-entente-delusion#8gjhsKwq6qQ3zmXeq
https://www.beren.io/2024-05-15-Alignment-Likely-Generalizes-Further-Than-Capabilities/
As a corollary of this, this also means that we can create robust reward functions out of that dataset that prevent a lot of common reward hacking strategies, and so long as the values were trained in early such that it internalized it’s values before it’s superhumanly capable at everything means we can let instrumental convergence do the rest of the work for us.
The main principle here is to put alignment data before or during capabilities data, not after. This is why RLHF and a whole lot of other post-training methods for alignment are so bad.
We really shouldn’t screw up our outer alignment success by then introducing a inner-misaligned agent inside our AI.
Thankfully, I now see a clear answer to the question of whether we should promote wireheading over inner misalignment, and the choice is to wirehead a robustly good reward function than it is to get into the ocean of inner misalignment.