TurnTrout comments on Paper: On measuring situational awareness in LLMs

TurnTrout 4 Sep 2023 18:21 UTC
LW: 13 AF: 6
7
AF
(Also, all AI-doom content should maybe be expunged as well, since “AI alignment is so hard” might become a self-fulfilling prophecy via sophisticated out-of-context reasoning baked in by pretraining.)
- Aaron_Scher 4 Sep 2023 23:05 UTC
  10 points
  5
  Parent
  On the other hand, the difficulty of alignment is something we may want all AIs to know so that they don’t build misaligned AGI (either autonomously or directed to by a user). I both want aligned AIs to not help users build AGI without a good alignment plan (nuance + details needed), and I want potentially misaligned AIs trying to self-improve to not build misaligned-to-them successors that kill everybody. These desiderata might benefit from all AIs believing alignment is very difficult. Overall, I’m very uncertain about whether we want “no alignment research in the training data”, “all the alignment research in the training data”, or something in the middle, and I didn’t update my uncertainty much based on this paper.