The existing literature on IDA (including a post about “reward engineering”) seems to have neglected to describe an outer alignment problem associated with using RL for distillation. (Analogous problems may also exist if using other ML techniques such as SL.) Source
I’m confused about what outer alignment problems might exist when using supervised learning for distillation (though maybe this is just due to me using an incorrect/narrower interpretation of “outer alignment problems” or “using supervised learning for distillation”).
I’m confused about what outer alignment problems might exist when using supervised learning for distillation (though maybe this is just due to me using an incorrect/narrower interpretation of “outer alignment problems” or “using supervised learning for distillation”).