Nora Belrose comments on Externalized reasoning oversight: a research direction for language model alignment

Nora Belrose 5 Aug 2022 21:09 UTC
0 points
0
When I think about “human-like reasoning” I’m mostly thinking about the causal structure of that reasoning— each step is causally connected to the other reasoning steps in the right kind of way. Luckily it seems like there are lots of ways that you could actually try to enforce this in the model. We can break the usual compute graph of the LM and, say, corrupt some of the tokens in the chain of thought or some of the hidden representations, and see what happens, similar to what the ROME paper did.
I’m actively thinking about how to put this into practice right now and I’m relatively optimistic about the idea.