Yeah, it seems like a length penalty would likely fix vestigial reasoning! (Although technically, this would be a form of process supervision.) I mentioned this in footnote #1 in case you didn’t already see it.
I believe @Daniel Kokotajlo expressed somewhere that he thinks we should avoid using a length penalty, though I can’t find this comment now. But it would be good to test empirically how much a length penalty increases steganography in practice. Maybe paraphrasing each sentence of the CoT during training would be good enough to prevent this.
Arguably, vestigial reasoning is actually useful! For instance, if we added a length penalty in the loan application setting, RL would probably eventually settle on writing no CoT at all, which gives us no information. However, without a length penalty, there’s a CoT that provides some unreliable but possibly-useful information about what the model historically would have thought about for any given prompt. This can be useful for the same reasons that biologists find ancient fossils useful even if they aren’t “faithful” to modern-day animals.
Was the model in Anthropic’s experiments tested OOD? I thought it was both trained and tested in the same environment, with a leaked hint in its context window.
Daniel’s argument against a length penalty is from this doc:
We want our models to learn to blather and babble freely, rather than thinking carefully about how to choose their words. Because if instead they are routinely thinking carefully about how to choose their words, that cognition might end up executing strategies like “use word X instead of Y, since thatʼll avoid suspicion.ˮ So, letʼs try to avoid incentivizing brevity.
There’s also a comment by Lukas Finnveden that argues in favor of a length penalty:
downside: more words gives more opportunity for steganography. You can have a much lower bit-rate-per-word and still accomplish the same tasks.
Yeah, it seems like a length penalty would likely fix vestigial reasoning! (Although technically, this would be a form of process supervision.) I mentioned this in footnote #1 in case you didn’t already see it.
I believe @Daniel Kokotajlo expressed somewhere that he thinks we should avoid using a length penalty, though I can’t find this comment now. But it would be good to test empirically how much a length penalty increases steganography in practice. Maybe paraphrasing each sentence of the CoT during training would be good enough to prevent this.
Arguably, vestigial reasoning is actually useful! For instance, if we added a length penalty in the loan application setting, RL would probably eventually settle on writing no CoT at all, which gives us no information. However, without a length penalty, there’s a CoT that provides some unreliable but possibly-useful information about what the model historically would have thought about for any given prompt. This can be useful for the same reasons that biologists find ancient fossils useful even if they aren’t “faithful” to modern-day animals.
Was the model in Anthropic’s experiments tested OOD? I thought it was both trained and tested in the same environment, with a leaked hint in its context window.
Daniel’s argument against a length penalty is from this doc:
There’s also a comment by Lukas Finnveden that argues in favor of a length penalty: