Do you think this means that if Claude 3.7 Sonnet does in fact use steganography then your earlier experiment wouldn’t have found evidence for it, since you didn’t fine-tune for long enough for the model to learn the scheme?
Of course this experiment also suggests one should lower their likelihood of Claude 3.7 Sonnet using steganography at all.
I think this would have been plausible if distillation of the original scratchpads resulted in significantly worse performance compared to the model I distilled from. But the gap is small, so I think we can exclude this possibility.
Do you think this means that if Claude 3.7 Sonnet does in fact use steganography then your earlier experiment wouldn’t have found evidence for it, since you didn’t fine-tune for long enough for the model to learn the scheme?
Of course this experiment also suggests one should lower their likelihood of Claude 3.7 Sonnet using steganography at all.
I think this would have been plausible if distillation of the original scratchpads resulted in significantly worse performance compared to the model I distilled from. But the gap is small, so I think we can exclude this possibility.