The information asymmetry problem between H and A seems like a major issue. For me, it’s the strongest argument for why transparency is a necessary ingredient of a solution to alignment. The argument against imitation learning and IRL is quite strong, in the sense that it seems like you can’t rely on either of them to capture the right behavior. These are stronger than the arguments against ambitious value learning because here we assume that H is smarter than A, which we could not do with ambitious value learning. So it does seem to me that direct supervision (with semi-supervised techniques and robustness) is the most likely path forward to solving the reward engineering problem.
There is also the question of whether it is necessary to solve the reward engineering problem. It certainly seems necessary in order to implement iterated amplification given current systems (where the distillation step will be implemented with optimization, which means that we need a reward signal), but might not be necessary if we move away from optimization or if we build systems using some technique other than iterated amplification (though even then it seems very useful to have a good reward engineering solution).
(Cross-posting Alignment Newsletter opinion here)
The information asymmetry problem between H and A seems like a major issue. For me, it’s the strongest argument for why transparency is a necessary ingredient of a solution to alignment. The argument against imitation learning and IRL is quite strong, in the sense that it seems like you can’t rely on either of them to capture the right behavior. These are stronger than the arguments against ambitious value learning because here we assume that H is smarter than A, which we could not do with ambitious value learning. So it does seem to me that direct supervision (with semi-supervised techniques and robustness) is the most likely path forward to solving the reward engineering problem.
There is also the question of whether it is necessary to solve the reward engineering problem. It certainly seems necessary in order to implement iterated amplification given current systems (where the distillation step will be implemented with optimization, which means that we need a reward signal), but might not be necessary if we move away from optimization or if we build systems using some technique other than iterated amplification (though even then it seems very useful to have a good reward engineering solution).