We saw this directly in the Chinese models experiments
Could you add a link for these experiments?
It’s a section of the post: https://www.lesswrong.com/posts/LXQBcztrWKhtcgQfJ/current-activation-oracles-are-hard-to-use?commentId=MKP8hB4Lik9FrHxfB#Can_AOs_extract_censored_knowledge_from_Chinese_models_
Could you add a link for these experiments?
It’s a section of the post: https://www.lesswrong.com/posts/LXQBcztrWKhtcgQfJ/current-activation-oracles-are-hard-to-use?commentId=MKP8hB4Lik9FrHxfB#Can_AOs_extract_censored_knowledge_from_Chinese_models_