Charbel-Raphaël comments on Against Almost Every Theory of Impact of Interpretability

Charbel-Raphaël 18 Aug 2023 9:01 UTC
3 points
1
High level strategy “as primarily a bet on creating new affordances upon which new alignment techniques can be built”.
Makes sense, but I think this is not the optimal resource allocation. I explain why below:
(Similarly, based on the work you express excitement about in your post, it seems like you are targeting an endgame of “indefinite, or at least very long, pause on AI progress”. If that’s your position I wish you would have instead written a post that was instead titled “against almost every theory of impact of alignment” or something like that.)
Yes, the pause is my secondary goal (edit: conditional on no significant alignment progress, otherwise smart scaling and regulations are my priorities). My primary goal remains coordination and safety culture. Mainly, I believe that one of the main pivotal processes goes through governance and coordination. A quote that explains my reasoning well is the following:
- “That is why focusing on coordination is crucial! There is a level of coordination above which we don’t die—there is no such threshold for interpretability. We currently live in a world where coordination is way more valuable than interpretability techniques. So let’s not forget that non-alignment aspects of AI safety are key! AI alignment is only a subset of AI safety! (I’m planning to deep-dive more into this in a following post).”
That’s why I really appreciate Dan Hendryck’s work on coordination. And I think DeepMind and OpenAI could make a huge contribution by doing technical work that is useful for governance. We’ve talked a bit during the EAG, and I understood that there’s something like a numerus clausus in DeepMind’s safety team. In that case, since interpretability doesn’t require a lot of computing power/prestige and as DeepMind has a very high level of prestige, you should use it to write papers that help with coordination. Interpretability could be done outside the labs.
For example, some of your works like Model evaluation for extreme risks, or Goal Misgeneralization: Why Correct Specifications Aren’t Enough For Correct Goals, are great for such purpose!