Authenticity seems to me like it would be hard or impossible to guarantee. Humans regularly think in a serial stream of verbal reasoning, but even to ourselves, alone inside our head, we often do so deceptively, rationalizing decisions already made for entirely different reasons to make them look acceptable to inner censors. What’s to stop a language model from doing the same thing?
Ideally, if reasoning is displayed that isn’t really causally responsible for the conclusion that should be picked up by the tests we create. Tests 2 and 4 in the linked doc start to get at this (https://hackmd.io/@tamera/HJ7iu0ST5-REMOVE, but remove “-REMOVE”), and we could likely develop better tests along these same lines.
Even if we failed to create tests that check for this, the architectures / training processes / prompting strategies that we create might make us more assured that the reasoning is legit. For example, the selection-inference strategy used by DeepMind breaks the reasoning into multiple context windows, which might reduce the chance of long chains of motivated reasoning.
Authenticity seems to me like it would be hard or impossible to guarantee. Humans regularly think in a serial stream of verbal reasoning, but even to ourselves, alone inside our head, we often do so deceptively, rationalizing decisions already made for entirely different reasons to make them look acceptable to inner censors. What’s to stop a language model from doing the same thing?
Ideally, if reasoning is displayed that isn’t really causally responsible for the conclusion that should be picked up by the tests we create. Tests 2 and 4 in the linked doc start to get at this (https://hackmd.io/@tamera/HJ7iu0ST5-REMOVE, but remove “-REMOVE”), and we could likely develop better tests along these same lines.
Even if we failed to create tests that check for this, the architectures / training processes / prompting strategies that we create might make us more assured that the reasoning is legit. For example, the selection-inference strategy used by DeepMind breaks the reasoning into multiple context windows, which might reduce the chance of long chains of motivated reasoning.