Oliver Daniels comments on Test your interpretability techniques by de-censoring Chinese models

Oliver Daniels 16 Jan 2026 2:38 UTC
1 point
0
also curious how you feel about using standard harmfulness questions for auditing MOs.

My vague sense is that models tend to have more reflexive refusal mechanisms that asking Chinese models about CCP censored stuff, but does seem like models strategically lie sometimes. (ofc you’d also need to setup a weak-to-strong situational where the weak model genuinely doesn’t know how to make a bomb, chem weapon, etc., but seems doable)