Key idea: Legibility is not well-defined in a vacuum. It only makes sense to talk about legibility w.r.t a specific observer (and their latent knowledge). Things that are legible from the model’s POV may not be legible to humans.
This means that, from a capabilities perspective, there is not much difference between “CoT reasoning not fully making sense to humans” and “CoT reasoning actively hides important information in a way that tries to deceive overseers”.
Key idea: Legibility is not well-defined in a vacuum. It only makes sense to talk about legibility w.r.t a specific observer (and their latent knowledge). Things that are legible from the model’s POV may not be legible to humans.
This means that, from a capabilities perspective, there is not much difference between “CoT reasoning not fully making sense to humans” and “CoT reasoning actively hides important information in a way that tries to deceive overseers”.