Next, Nina argues that the fact that LLMs don’t directly encode your reward function makes them less likely to be misaligned, not more, the way IABIED implies. I think maybe she’s straw-manning the concerns here. She asks “What would it mean for models to encode their reward functions without the context of training examples?” But nobody’s arguing against examples, they’re just saying it might be more reassuring if the architecture directly included the reward function in the model itself. For instance, if the model was at every turn searching over possible actions and choosing one that will maximize this reward function. (Of course, no one knows how to do that in practice yet, but everyone’s on the same page about that.)
Can you expand a little bit on this? I don’t understand why replacing “here are some examples of world states and what actions led to good/bad outcomes, try to take actions which are similar to the ones which led to good outcomes in similar world states in the past” with “here’s a reward function directly mapping world-state → goodness” would be reassuring rather than alarming.
Having more insight into exactly what past world states are most salient for choosing the next action, and why it thinks those world states are relevant, is desirable. But “we don’t currently have enough insight with today’s models for technical reasons” doesn’t feel like a good reason to say “and therefore we should throw away this entire promising branch of the tech tree and replace it with one that has had [major problems every time we’ve tried it](https://en.wikipedia.org/wiki/Goodhart%27s_law)”.
Am I misinterpreting what you’re saying though, and there’s a different thing which everyone is on the same page about?
Can you expand a little bit on this? I don’t understand why replacing “here are some examples of world states and what actions led to good/bad outcomes, try to take actions which are similar to the ones which led to good outcomes in similar world states in the past” with “here’s a reward function directly mapping world-state → goodness” would be reassuring rather than alarming.
Having more insight into exactly what past world states are most salient for choosing the next action, and why it thinks those world states are relevant, is desirable. But “we don’t currently have enough insight with today’s models for technical reasons” doesn’t feel like a good reason to say “and therefore we should throw away this entire promising branch of the tech tree and replace it with one that has had [major problems every time we’ve tried it](https://en.wikipedia.org/wiki/Goodhart%27s_law)”.
Am I misinterpreting what you’re saying though, and there’s a different thing which everyone is on the same page about?