I disagree with the claim that people are treating this as something that must be executed perfectly, it’s already being treated as a matter of degree in practice. For example, consider OpenAI’s deliberative alignment paper: as Baker et al. acknowledge, distilling reasoning about refusals into the CoT is a form of implicit optimization pressure and changes what the CoT is like, but the optimization pressure is weak enough that OpenAI appears to consider it consistent with their broader policy of not optimizing the CoT. Similarly, we’ve known for a while that reinforcement spillovers are a thing, but the consensus seems to be that the effect is small enough that switching to shoggoth+face isn’t necessary. In contrast, directly revealing the CoT to a reward model, as Anthropic did, is a much more direct form of optimization pressure and accordingly, people are much more concerned about that.
I disagree with the claim that people are treating this as something that must be executed perfectly, it’s already being treated as a matter of degree in practice. For example, consider OpenAI’s deliberative alignment paper: as Baker et al. acknowledge, distilling reasoning about refusals into the CoT is a form of implicit optimization pressure and changes what the CoT is like, but the optimization pressure is weak enough that OpenAI appears to consider it consistent with their broader policy of not optimizing the CoT. Similarly, we’ve known for a while that reinforcement spillovers are a thing, but the consensus seems to be that the effect is small enough that switching to shoggoth+face isn’t necessary. In contrast, directly revealing the CoT to a reward model, as Anthropic did, is a much more direct form of optimization pressure and accordingly, people are much more concerned about that.