CEO at Redwood Research.
AI safety is a highly collaborative field—almost all the points I make were either explained to me by someone else, or developed in conversation with other people. I’m saying this here because it would feel repetitive to say “these ideas were developed in collaboration with various people” in all my comments, but I want to have it on the record that the ideas I present were almost entirely not developed by me in isolation.
Please contact me via email (bshlegeris@gmail.com) instead of messaging me on LessWrong.
If we are ever arguing on LessWrong and you feel like it’s kind of heated and would go better if we just talked about it verbally, please feel free to contact me and I’ll probably be willing to call to discuss briefly.
Before this post, I’m not aware of anything people had written on what might happen after you catch your AI red-handed. I basically stand by everything we wrote here.
I’m a little sad that there hasn’t been much research following up on this. I’d like to see more, especially research on how you can get more legible evidence of misalignment from catching individual examples of your AI’s behaving badly, and research on few-shot catastrophe detection techniques.