Nora Belrose comments on Externalized reasoning oversight: a research direction for language model alignment

Nora Belrose 5 Aug 2022 21:00 UTC
6 points
1
I think Lethality 21 is largely true, and it’s a big reason why I’m concerned about the alignment problem in general. I’m not invoking Alignment by Default here because I think we do need to push hard on the actual cognitive processes happening in the agent, not just its actions/output like in prosaic ML. Externalized reasoning gives you a pretty good way of doing that.
I do think Lethality 22 is probably just false. Human values latch on to natural abstractions (!) and once you have the right ontology I don’t think they’re actually that complex. Language models are probably the most powerful tool we have for learning the human prior / ontology.