Ideally, if reasoning is displayed that isn’t really causally responsible for the conclusion that should be picked up by the tests we create. Tests 2 and 4 in the linked doc start to get at this (https://hackmd.io/@tamera/HJ7iu0ST5-REMOVE, but remove “-REMOVE”), and we could likely develop better tests along these same lines.
Even if we failed to create tests that check for this, the architectures / training processes / prompting strategies that we create might make us more assured that the reasoning is legit. For example, the selection-inference strategy used by DeepMind breaks the reasoning into multiple context windows, which might reduce the chance of long chains of motivated reasoning.
Ideally, if reasoning is displayed that isn’t really causally responsible for the conclusion that should be picked up by the tests we create. Tests 2 and 4 in the linked doc start to get at this (https://hackmd.io/@tamera/HJ7iu0ST5-REMOVE, but remove “-REMOVE”), and we could likely develop better tests along these same lines.
Even if we failed to create tests that check for this, the architectures / training processes / prompting strategies that we create might make us more assured that the reasoning is legit. For example, the selection-inference strategy used by DeepMind breaks the reasoning into multiple context windows, which might reduce the chance of long chains of motivated reasoning.