Nice post! I’m interested to hear more about how your methodology differs from others. Does this breakdown seem roughly right?
1. Naive AI alignment: We are satisfied by an alignment scheme that can tell a story about how it works. (This is what I expect to happen in practice at many AI labs.)
2. Typical-Case AI Alignment: We aren’t satisfied until we try hard to think of ways our scheme could fail, and still it doesn’t seem like failure is the most likely outcome. (This is what I expect the better sort of AI labs, the ones with big well-respected safety teams, will do.)
3. Worst-Case AI Alignment: We aren’t satisfied until we try hard to think of ways our scheme could fail, and can’t think of anything plausible. (This is your methodology, right?)
4. Ordinary Paranoia: We aren’t satisfied until we try hard to think of a way our scheme could fail, and can’t think of anything logically and physically possible. (Maybe this isn’t importantly different from #3? See below.)
5. Security Mindset: As with ordinary paranoia, except that also we aren’t satisfied until we can write a premise-conclusion form argument for why our scheme won’t fail, such that the premises don’t contain value-laden concepts and are in general fairly concrete/detailed, and such that each premise seems highly likely to be true. (This is what I think MIRI advocates? But I think I see shades of it in your methodology too.)
Second question: What counts as plausible? What does it mean for a story to contradict something we know to be true? The looser our standards for plausibility, the more your methodology ends up looking like Ordinary Paranoia. The stricter our standards for plausibility, the more it ends up looking like Typical-Case AI Alignment.
I don’t really think of 3 and 4 as very different, there’s definitely a spectrum regarding “plausible” and I think we don’t need to draw the line firmly—it’s OK if over time your “most plausible” failure mode becomes increasingly implausible and the goal is just to make it obviously completely implausible. I think 5 is a further step (doesn’t seem like a different methodology, but a qualitatively further-off stopping point, and the further off you go the more I expect this kind of theoretical research to get replaced by empirical research). I think of it as: after you’ve been trying for a while to come up with a failure story, you can start thinking about why failure stories seem impossible and try to write an argument that there can’t be any failure story...
Nice post! I’m interested to hear more about how your methodology differs from others. Does this breakdown seem roughly right?
1. Naive AI alignment: We are satisfied by an alignment scheme that can tell a story about how it works. (This is what I expect to happen in practice at many AI labs.)
2. Typical-Case AI Alignment: We aren’t satisfied until we try hard to think of ways our scheme could fail, and still it doesn’t seem like failure is the most likely outcome. (This is what I expect the better sort of AI labs, the ones with big well-respected safety teams, will do.)
3. Worst-Case AI Alignment: We aren’t satisfied until we try hard to think of ways our scheme could fail, and can’t think of anything plausible. (This is your methodology, right?)
4. Ordinary Paranoia: We aren’t satisfied until we try hard to think of a way our scheme could fail, and can’t think of anything logically and physically possible. (Maybe this isn’t importantly different from #3? See below.)
5. Security Mindset: As with ordinary paranoia, except that also we aren’t satisfied until we can write a premise-conclusion form argument for why our scheme won’t fail, such that the premises don’t contain value-laden concepts and are in general fairly concrete/detailed, and such that each premise seems highly likely to be true. (This is what I think MIRI advocates? But I think I see shades of it in your methodology too.)
Second question: What counts as plausible? What does it mean for a story to contradict something we know to be true? The looser our standards for plausibility, the more your methodology ends up looking like Ordinary Paranoia. The stricter our standards for plausibility, the more it ends up looking like Typical-Case AI Alignment.
I don’t really think of 3 and 4 as very different, there’s definitely a spectrum regarding “plausible” and I think we don’t need to draw the line firmly—it’s OK if over time your “most plausible” failure mode becomes increasingly implausible and the goal is just to make it obviously completely implausible. I think 5 is a further step (doesn’t seem like a different methodology, but a qualitatively further-off stopping point, and the further off you go the more I expect this kind of theoretical research to get replaced by empirical research). I think of it as: after you’ve been trying for a while to come up with a failure story, you can start thinking about why failure stories seem impossible and try to write an argument that there can’t be any failure story...
Thanks!