Oliver Daniels comments on Test your interpretability techniques by de-censoring Chinese models

Oliver Daniels 15 Jan 2026 23:07 UTC
3 points
0
Really cool work. Two thoughts:
1. Someone should test activation oracles on these tasks
2. Have you thought about optimizing the investigator agent against the consistency judge (or other, possibly more robust, unsupervised metrics?)
- aryaj 15 Jan 2026 23:43 UTC
  10 points
  2
  Parent
  Thank you!
  1. we are currently working on this and seeing some interesting results :)
  2. This would be a cool future direction! both coming up with better consistency judges and optimizing the models against them would be very helpful—we find the models usually go too deep into irrelevant details or reward hack their conclusions
  - Oliver Daniels 16 Jan 2026 2:34 UTC
    1 point
    0
    Parent
    nice, looks promising!