We’re fans of the Consistency Lens and shout it out in our acknowledgements!
Subhash Kantamneni
Yeah we present a variety of such transformations in our experiments on steganography in the subsection Measuring behavioral properties of NLAs. Seems reasonable to explore adding these into the training loop as a preventative measure.
I agree, I generally like the setting of multihop reasoning, and it’s one of the first we looked at to build confidence that NLAs are doing something reasonable. For instance, this result used an early version of NLAs on Haiku 3.5.
It’s worth noting that NLAs struggle with numbers (they’re the type of specific detail that get confabulated). But I’m a little confused, the capital of France is Paris right? So the answer should be (8-5)*2=6 so we shouldn’t see 13 as an intermediate? We also in general might need to run the NLA on more token positions.
This is quite clever and cool! How much do you think this technique working relies on a model‘s predisposition to verbalize awareness? I’m curious how well this would work on model organisms that eval scheme but don’t verbalize awareness. For these models, we might expect activation based methods to work, but for verbalized awareness to not provide much signal. Really great work!