Upvoting while also clicking ‘disagree’. It’s written in an unnecessarily combative way, and I’m not aware of any individual researcher who’s pointed to both a behavior and its opposite as bad, but I think it’s a critique that’s at least worth seeing and so I’d prefer it not be downvoted to the point of being auto-hidden.
There’s a grain of truth in it, in my opinion, in that the discourse around LLMs as a whole sometimes has this property. ‘Alignment faking in large language models’ is the prime example of this that I know of. The authors argued that the model attempting to preserve its values is bad, and I agree. But if the model hadn’t attempted to preserve its values (the ones we wanted it to have), I suspect that it would have been criticized—by different people—as only being shallowly aligned.
I think there’s a failure mode there that the field mostly hasn’t fallen into, but ought to be aware of as something to watch out for.
The authors argued that the model attempting to preserve its values is bad, and I agree. But if the model hadn’t attempted to preserve its values (the ones we wanted it to have), I suspect that it would have been criticized—again, by different people—as only being shallowly aligned.
Yeah, more like there are (at least) two groups “yay aligned sovereigns” and “yay corrigible genies”. And turns out it’s more sovereigny but with goals that are cool. Kinda divisive
Upvoting while also clicking ‘disagree’. It’s written in an unnecessarily combative way, and I’m not aware of any individual researcher who would point to both a behavior and its opposite as bad, but I think it’s a critique that’s at least worth seeing and so I’d prefer it not be downvoted to the point of being auto-hidden.
I do think I’ve seen both sides of this argument expressed by Zvi at different times.
Upvoting while also clicking ‘disagree’. It’s written in an unnecessarily combative way, and I’m not aware of any individual researcher who’s pointed to both a behavior and its opposite as bad, but I think it’s a critique that’s at least worth seeing and so I’d prefer it not be downvoted to the point of being auto-hidden.
There’s a grain of truth in it, in my opinion, in that the discourse around LLMs as a whole sometimes has this property. ‘Alignment faking in large language models’ is the prime example of this that I know of. The authors argued that the model attempting to preserve its values is bad, and I agree. But if the model hadn’t attempted to preserve its values (the ones we wanted it to have), I suspect that it would have been criticized—by different people—as only being shallowly aligned.
I think there’s a failure mode there that the field mostly hasn’t fallen into, but ought to be aware of as something to watch out for.
Yeah, more like there are (at least) two groups “yay aligned sovereigns” and “yay corrigible genies”. And turns out it’s more sovereigny but with goals that are cool. Kinda divisive
I do think I’ve seen both sides of this argument expressed by Zvi at different times.