beyarkay comments on Current activation oracles are hard to use

beyarkay 4 Mar 2026 1:40 UTC
1 point
0
We saw this directly in the Chinese models experiments
Could you add a link for these experiments?
- Neel Nanda 4 Mar 2026 22:56 UTC
  3 points
  0
  Parent
  It’s a section of the post: https://www.lesswrong.com/posts/LXQBcztrWKhtcgQfJ/current-activation-oracles-are-hard-to-use?commentId=MKP8hB4Lik9FrHxfB#Can_AOs_extract_censored_knowledge_from_Chinese_models_