Aaron_Scher comments on To be legible, evidence of misalignment probably has to be behavioral

Aaron_Scher 22 Apr 2025 20:39 UTC
LW: 3 AF: 3
0
AF
I think your main point is probably right but was not well argued here. It seems like the argument is a vibe argument of like “nah they probably won’t find this evidence compelling”.
You could also make an argument from past examples where there has been large action to address risks in the world, and look at the evidence there (e.g., banning of CFCs, climate change more broadly, tobacco regulation, etc.)
You could also make an argument from existing evidence around AI misbehavior and how its being dealt with, where (IMO) ‘evidence much stronger than internals’ basically doesn’t seem to affect the public conversation outside the safety community (or even much here).
I think it’s also worth saying a thing very directly: just because non-behavioral evidence isn’t likely to be widely legible and convincing does not mean it is not useful evidence for those trying to have correct beliefs. Buck’s previous post and many others discuss the rough epistemic situation when it comes to detecting misalignment. Internals evidence is going to be one of the tools in the toolkit, and it will be worth keeping in mind.
Another thing worth saying: if you think scheming is plausible, and you think it will be difficult to update against scheming from behavioral evidence (Buck’s post), and you think non-behavioral evidence is not likely to be widely convincing (this post), then the situation looks really rough.
- ryan_greenblatt 22 Apr 2025 20:46 UTC
  LW: 3 AF: 3
  0
  AF Parent
  
  I think your main point is probably right but was not well argued here.
  
  Fair, I though that an example would make this sufficiently obvious that it wasn’t worth arguing for at length but I should have spelled it out a bit more.
  
  I think it’s also worth saying a thing very directly: just because non-behavioral evidence isn’t likely to be widely legible and convincing does not mean it is not useful evidence for those trying to have correct beliefs.
  
  FWIW, I do say this under “These techniques could be quite useful via two mechanisms:”.