tamera comments on Externalized reasoning oversight: a research direction for language model alignment

tamera 4 Aug 2022 18:52 UTC
2 points
0
Ideally, if reasoning is displayed that isn’t really causally responsible for the conclusion that should be picked up by the tests we create. Tests 2 and 4 in the linked doc start to get at this (https://hackmd.io/@tamera/HJ7iu0ST5-REMOVE, but remove “-REMOVE”), and we could likely develop better tests along these same lines.
Even if we failed to create tests that check for this, the architectures / training processes / prompting strategies that we create might make us more assured that the reasoning is legit. For example, the selection-inference strategy used by DeepMind breaks the reasoning into multiple context windows, which might reduce the chance of long chains of motivated reasoning.