quetzal_rainbow comments on Deconfusing “Capabilities vs. Alignment”

quetzal_rainbow 23 Jan 2023 6:58 UTC
1 point
2
I think that interpretability researchers with concern about safety of their discoveries should consider option “don’t publish potentially dangerous results”.
- RobertM 23 Jan 2023 7:02 UTC
  2 points
  0
  Parent
  Yes, that is a concern I share:
  (Please keep in mind that if you do come to have a model by which some specific interpretability technique can be used to make rapid capabilities advancements, it is not obvious to me that a public LessWrong post is the best way to warn other interpretability researchers of this fact.)
  - quetzal_rainbow 23 Jan 2023 7:32 UTC
    1 point
    0
    Parent
    To rephrase: I think the default mode for discoveries in interpretability should be “don’t publish” and publishing should happen after a careful weighing of upsides and downsides. Researchers need to train themselves to the unconditional mental motion “do I really want everybody to know about this?”
    - RobertM 23 Jan 2023 7:38 UTC
      2 points
      0
      Parent
      Yep, that’s a distinct claim from the one I was making. It’s not a crazy position to have as an ideal to strive for, but I’m not confident that it ends up being net-upside in the current regime absent other concurrent changes. Need to think about it more.