This is probably one of the most important post on alignment on this forum. Seriously. I want everyone thinking about conceptual alignment, and everyone trying conceptual alignment, to read this and think about it deeply.
Glad you think so! I definitely agree and am planning on using this framework in my own research going forward.
“story” makes technical people feel uncomfortable. We immediately fear weird justification and biases towards believing interesting stories. And we should be wary of this when working on alignment, while acknowledging that most of our knowledge will take a form like that. So the word reminds us daily to not feel too comfortable with our ideas and intuitions, as we always risk falling for our own inventions.
Yep, this is definitely intentional. I think in many ways just thinking about inner alignment as avoiding proxy-aligned mesa-optimizers can give you false confidence in your training story because you reason “of course I won’t get that specific failure model”—but the problem is that you need to couple some reason that you won’t get the wrong thing with some strong reason that you actually will get the right thing to really be confident in your training process’s safety.
Glad you think so! I definitely agree and am planning on using this framework in my own research going forward.
Yep, this is definitely intentional. I think in many ways just thinking about inner alignment as avoiding proxy-aligned mesa-optimizers can give you false confidence in your training story because you reason “of course I won’t get that specific failure model”—but the problem is that you need to couple some reason that you won’t get the wrong thing with some strong reason that you actually will get the right thing to really be confident in your training process’s safety.