Rohin Shah comments on Refactoring Alignment (attempt #2)

Rohin Shah 29 Jul 2021 6:20 UTC
LW: 2 AF: 2
AF
how optimistic are you that we could figure out how to shape the motivations or internal “goals” (much more loosely defined than “mesa-objective”) of our models via influencing the training objective/reward, the inductive biases of the model, the environments they’re trained in, some combination of these things, etc.?
That seems great, e.g. I think by far the best thing you can do is to make sure that you finetune using a reward function / labeling process that reflects what you actually want (i.e. what people typically call “outer alignment”). I probably should have mentioned that too, I was taking it as a given but I really shouldn’t have.
For inductive biases + environments, I do think controlling those appropriately would be useful and I would view that as an example of (1) in my previous comment.