aryaj comments on Test your interpretability techniques by de-censoring Chinese models

aryaj 15 Jan 2026 23:43 UTC
10 points
2
Thank you!
1. we are currently working on this and seeing some interesting results :)
2. This would be a cool future direction! both coming up with better consistency judges and optimizing the models against them would be very helpful—we find the models usually go too deep into irrelevant details or reward hack their conclusions
- Oliver Daniels 16 Jan 2026 2:34 UTC
  1 point
  0
  Parent
  nice, looks promising!