I think that interpretability researchers with concern about safety of their discoveries should consider option “don’t publish potentially dangerous results”.
(Please keep in mind that if you do come to have a model by which some specific interpretability technique can be used to make rapid capabilities advancements, it is not obvious to me that a public LessWrong post is the best way to warn other interpretability researchers of this fact.)
To rephrase: I think the default mode for discoveries in interpretability should be “don’t publish” and publishing should happen after a careful weighing of upsides and downsides. Researchers need to train themselves to the unconditional mental motion “do I really want everybody to know about this?”
Yep, that’s a distinct claim from the one I was making. It’s not a crazy position to have as an ideal to strive for, but I’m not confident that it ends up being net-upside in the current regime absent other concurrent changes. Need to think about it more.
I think that interpretability researchers with concern about safety of their discoveries should consider option “don’t publish potentially dangerous results”.
Yes, that is a concern I share:
To rephrase: I think the default mode for discoveries in interpretability should be “don’t publish” and publishing should happen after a careful weighing of upsides and downsides. Researchers need to train themselves to the unconditional mental motion “do I really want everybody to know about this?”
Yep, that’s a distinct claim from the one I was making. It’s not a crazy position to have as an ideal to strive for, but I’m not confident that it ends up being net-upside in the current regime absent other concurrent changes. Need to think about it more.