paulfchristiano comments on Aligning a toy model of optimization

paulfchristiano 30 Jun 2019 19:10 UTC
LW: 2 AF: 1
AF
The idea is that “the task being trained” is something like: 50% what you care about at the object level, 50% the subtasks that occur in the evaluation process. The model may sometimes get worse at the evaluation process, or at the object level task, you are just trying to optimize some weighted combination.
There are a bunch of distinct difficulties here. One is that the distribution of “subtasks that occur in the evaluation process” is nonstationary. Another is that we need to set up the game so that doing both evaluation and the object level task is not-much-harder than just doing the object level task.