I don’t know why this is a critique of RLHF. You can use RLHF to train a model to ask you questions when it’s confused about your preferences. To the extent to which you can easily identify when a system has this behavior and when it doesn’t, you can use RLHF to produce this behavior just like you can use RLHF to produce many other desired behaviors.
I dunno, I’d agree with LA Paul here. There’s a difference between cases where you’re not sure whether it’s A, or it’s B, or it’s C, and cases where A, B, and C are all valid outcomes, and you’re doing something more like calling one of them into existence by picking it.
The first cases are times when the AI doesn’t know what’s right, but to the human it’s obvious and uncomplicated which is right. The second cases are where human preferences are underdetermined—where there are multiple ways we could be in the future that are all acceptably compatible with how we’ve been up til now.
I think models that treat the thing its learning as entirely the first sort of thing are going to do fine on the obvious and uncomplicated questions, but would learn to to resolve questions of the second type using processes we wouldn’t approve of.
This is a broader criticism of alignment to preferences or intent in general, since these things can change (and sometimes, you can even make choices of whether to change them or not). L.A. Paul wrote a whole book about this sort of thing; if you’re interested, here’s a good talk.
That’s fair. I think it’s a critique of RLHF as it is currently done (just get lots of preferences over outputs and train your model). I don’t think just asking you questions “when it’s confused” is sufficient, it also has to know when to be confused. But RLHF is a pretty general framework, so you could theoretically expose a model to lots of black swan events (not just mildly OOD events) and make sure it reacts to them appropriately (or asks questions). But as far as I know, that’s not research that’s currently happening (though there might be something I’m not aware of).
I don’t mean to distract from your overall point though which I take to be “a philosopher said a smart thing about AI alignment despite not having much exposure.” That’s useful data.
I don’t know why this is a critique of RLHF. You can use RLHF to train a model to ask you questions when it’s confused about your preferences. To the extent to which you can easily identify when a system has this behavior and when it doesn’t, you can use RLHF to produce this behavior just like you can use RLHF to produce many other desired behaviors.
I dunno, I’d agree with LA Paul here. There’s a difference between cases where you’re not sure whether it’s A, or it’s B, or it’s C, and cases where A, B, and C are all valid outcomes, and you’re doing something more like calling one of them into existence by picking it.
The first cases are times when the AI doesn’t know what’s right, but to the human it’s obvious and uncomplicated which is right. The second cases are where human preferences are underdetermined—where there are multiple ways we could be in the future that are all acceptably compatible with how we’ve been up til now.
I think models that treat the thing its learning as entirely the first sort of thing are going to do fine on the obvious and uncomplicated questions, but would learn to to resolve questions of the second type using processes we wouldn’t approve of.
This is a broader criticism of alignment to preferences or intent in general, since these things can change (and sometimes, you can even make choices of whether to change them or not). L.A. Paul wrote a whole book about this sort of thing; if you’re interested, here’s a good talk.
Yes, I was deliberately phrasing things sort of like transformative experiences :P
That’s fair. I think it’s a critique of RLHF as it is currently done (just get lots of preferences over outputs and train your model). I don’t think just asking you questions “when it’s confused” is sufficient, it also has to know when to be confused. But RLHF is a pretty general framework, so you could theoretically expose a model to lots of black swan events (not just mildly OOD events) and make sure it reacts to them appropriately (or asks questions). But as far as I know, that’s not research that’s currently happening (though there might be something I’m not aware of).
I don’t mean to distract from your overall point though which I take to be “a philosopher said a smart thing about AI alignment despite not having much exposure.” That’s useful data.