Knight Lee comments on To be legible, evidence of misalignment probably has to be behavioral

Knight Lee 16 Apr 2025 7:01 UTC
2 points
0
When Gemini randomly told an innocent user to go kill himself, it made the news, but this news didn’t really affect very much in the big picture.
It’s possible that relevant decision-makers don’t care that much about dramatic bad behaviours since the vibe is “oh yeah AI glitches up, oh well.”
It’s possible that relevant decision-makers do care more about what the top experts believe, and if the top experts are convinced that current models already want to kill you (but can’t), it may have an effect. Imagine if many top experts agree that “the lie detectors start blaring like crazy when the AI is explaining how it won’t kill all humans even if can get away with it.”
I’m not directly disagreeing with this post, I’m just saying there exists this possible world model where behavioural evidence isn’t much stronger (than other misalignment evidence).
- Katalina Hernandez 16 Apr 2025 9:23 UTC
  3 points
  1
  Parent
  @Knight Lee This is precisely one of the incidents I’ve seen (in Europe) Policy people refer to when arguing “why GenAI providers need to be held accountable” for misbehaviours like this.
  It is sad that this is example inspired regulatory actions in other jurisdictions, but not where the incident happen...
  - Knight Lee 16 Apr 2025 21:10 UTC
    2 points
    0
    Parent
    Oops. Maybe this kind of news does affect decision makers and I was wrong. I was just guessing that it had little effect, since… I’m not even sure why I thought so.
    I did a Google search and it didn’t look like the kind of news that governments responded to.
    - Katalina Hernandez 16 Apr 2025 21:22 UTC
      2 points
      0
      Parent
      No, you’re right! It is just a policy/ AI Safety advocacy argument, but one that does change minds and shape decisions. I guess it’s not as visible as it should be. Still, glad you brought this up!