Towards_Keeperhood comments on Can we safely automate alignment research?

Towards_Keeperhood 18 Nov 2025 8:22 UTC
1 point
0
Behavioral science of generalization. The first is just: studying AI behavior in depth, and using this to strengthen our understanding of how AIs will generalize to domains that our scalable oversight techniques struggle to evaluate directly.
- Work in the vicinity of “weak to strong” generalization is a paradigm example here. Thus, for example: if you can evaluate physics problems of difficulty level 1 and 2, but not difficulty level 3, then you can train an AI on level 1 problems, and see if it generalizes well to level 2 problems, as a way of getting evidence about whether it would generalize well to level 3 problems as well.
  (This doesn’t work on schemers, or on other AIs systematically and successfully manipulating your evidence about how they’ll generalize, but see discussion of anti-scheming measures below.)
I don’t think this just fails with schemers. A key problem is that it’s hard to distinguish whether you’re measuring “this alignment approach is good” or “this alignment approach looks good to humans”. If it’s the latter, it looks great on level 1 and 2 but then the approach for 3 doesn’t actually work. I unfortunately expect that if we train AIs to evaluate what is good alignment research, they will more likely learn the latter. (This problem seems related to ELK.)