ryan_greenblatt comments on Introducing Alignment Stress-Testing at Anthropic

ryan_greenblatt 20 Jan 2024 1:34 UTC
LW: 2 AF: 2
0
AF

I’m less concerned about this; I think it’s relatively easy to give AIs “outs” here where we e.g. pre-commit to help them if they come to us with clear evidence that they’re moral patients in pain.

I’m not sure I overall disagree, but the problem seems trickier than what you’re describing

I think it might be relatively hard to credibly pre-commit. Minimally, you might need to make this precommit now and seed it very widely in the corpus (so it is a credible and hard to fake signal). Also, it’s unclear what we can do if AIs always say “please don’t train or use me, it’s torture”, but we still need to use AI.