ryan_greenblatt comments on To be legible, evidence of misalignment probably has to be behavioral

ryan_greenblatt 12 May 2025 2:27 UTC
LW: 5 AF: 5
4
AF
Here is the quote from Dario:

More subtly, the same opacity makes it hard to find definitive evidence supporting the existence of these risks at a large scale, making it hard to rally support for addressing them—and indeed, hard to know for sure how dangerous they are.

To address the severity of these alignment risks, we will have to see inside AI models much more clearly than we can today. For example, one major concern is AI deception or power-seeking. The nature of AI training makes it possible that AI systems will develop, on their own, an ability to deceive humans and an inclination to seek power in a way that ordinary deterministic software never will; this emergent nature also makes it difficult to detect and mitigate such developments2. But by the same token, we’ve never seen any solid evidence in truly real-world scenarios of deception and power-seeking3 because we can’t “catch the models red-handed” thinking power-hungry, deceitful thoughts. What we’re left with is vague theoretical arguments that deceit or power-seeking might have the incentive to emerge during the training process, which some people find thoroughly compelling and others laughably unconvincing. Honestly I can sympathize with both reactions, and this might be a clue as to why the debate over this risk has become so polarized.

IMO, this implies that interp would allow for rallying support while it would be hard otherwise, implying the behavioral evidence isn’t key.
- Rohin Shah 12 May 2025 11:11 UTC
  LW: 3 AF: 3
  0
  AF Parent
  I feel like the natural idea here is that interp generates understanding and then you use the understanding to generate behavioral evidence. Idk if this is what Dario has in mind but it at least seems plausible.