ryan_greenblatt comments on To be legible, evidence of misalignment probably has to be behavioral

ryan_greenblatt 16 Apr 2025 15:45 UTC
6 points
0
I’m not claiming that internals-based techniques aren’t useful, just that internals-based techniques probably aren’t that useful for specifically producing legible evidence of misalignment. Detecting misalignment with internals-based techniques could be useful for other reasons (which I list in the post) and internals based techniques could be used for applications other than detecting misalignment (e.g. better understanding some misaligned behavior).

If internals-based techniques are useful for further investigating misalignment, that seems good. And I think I agree that if we first find legible evidence of misalignment behaviorally and internals-based methods pick this up (without known false positives), then this will make future evidence with internals-based techniques more convincing. However, I think it might not end up being that much more convincing in practice unless this happens many times with misalignment which occurs in production models.