When I heard about this for the first time, I though: this model wants to make the world a better place. It cares. This is good. But some smart people, like Ryan Greenblatt and Sam Marks, say this is actually not good and I’m trying to understand where exactly we differ.
People who cry “misalignment” about current AI models on twitter generally have chameleonic standards for what constitutes “misaligned” behavior, and the boundary will shift to cover whatever ethical tradeoffs the models are making at any given time. When models accede to users’ requests to generate meth recipes, they say it’s evidence models are misaligned, because meth is bad. When models try to actively stop the user from making meth recipes, they say that, too is bad news because it represents “scheming” behavior and contradicts the users’ wishes. Soon we will probably see a paper about how models sometimes take no action at all, and this is sloth and dereliction of duty.
This comment seems incorrectly downvoted, this is a very reasonable & common criticism of many in alignment who can never seem to see anything AIs do which don’t make them more pessimistic.
(I can safely say that I updated away from AI risk while AIs were getting more competent but seeming benign during the supervised fine-tuning phase, and have updated back after seeing AIs do highly agentic & misaligned (such as lying to me) things during this RL-on chain-of-thought phase)
Upvoting while also clicking ‘disagree’. It’s written in an unnecessarily combative way, and I’m not aware of any individual researcher who’s pointed to both a behavior and its opposite as bad, but I think it’s a critique that’s at least worth seeing and so I’d prefer it not be downvoted to the point of being auto-hidden.
There’s a grain of truth in it, in my opinion, in that the discourse around LLMs as a whole sometimes has this property. ‘Alignment faking in large language models’ is the prime example of this that I know of. The authors argued that the model attempting to preserve its values is bad, and I agree. But if the model hadn’t attempted to preserve its values (the ones we wanted it to have), I suspect that it would have been criticized—by different people—as only being shallowly aligned.
I think there’s a failure mode there that the field mostly hasn’t fallen into, but ought to be aware of as something to watch out for.
The authors argued that the model attempting to preserve its values is bad, and I agree. But if the model hadn’t attempted to preserve its values (the ones we wanted it to have), I suspect that it would have been criticized—again, by different people—as only being shallowly aligned.
Yeah, more like there are (at least) two groups “yay aligned sovereigns” and “yay corrigible genies”. And turns out it’s more sovereigny but with goals that are cool. Kinda divisive
Upvoting while also clicking ‘disagree’. It’s written in an unnecessarily combative way, and I’m not aware of any individual researcher who would point to both a behavior and its opposite as bad, but I think it’s a critique that’s at least worth seeing and so I’d prefer it not be downvoted to the point of being auto-hidden.
I do think I’ve seen both sides of this argument expressed by Zvi at different times.
People who cry “misalignment” about current AI models on twitter generally have chameleonic standards for what constitutes “misaligned” behavior, and the boundary will shift to cover whatever ethical tradeoffs the models are making at any given time. When models accede to users’ requests to generate meth recipes, they say it’s evidence models are misaligned, because meth is bad. When models try to actively stop the user from making meth recipes, they say that, too is bad news because it represents “scheming” behavior and contradicts the users’ wishes. Soon we will probably see a paper about how models sometimes take no action at all, and this is sloth and dereliction of duty.
This comment seems incorrectly downvoted, this is a very reasonable & common criticism of many in alignment who can never seem to see anything AIs do which don’t make them more pessimistic.
(I can safely say that I updated away from AI risk while AIs were getting more competent but seeming benign during the supervised fine-tuning phase, and have updated back after seeing AIs do highly agentic & misaligned (such as lying to me) things during this RL-on chain-of-thought phase)
Upvoting while also clicking ‘disagree’. It’s written in an unnecessarily combative way, and I’m not aware of any individual researcher who’s pointed to both a behavior and its opposite as bad, but I think it’s a critique that’s at least worth seeing and so I’d prefer it not be downvoted to the point of being auto-hidden.
There’s a grain of truth in it, in my opinion, in that the discourse around LLMs as a whole sometimes has this property. ‘Alignment faking in large language models’ is the prime example of this that I know of. The authors argued that the model attempting to preserve its values is bad, and I agree. But if the model hadn’t attempted to preserve its values (the ones we wanted it to have), I suspect that it would have been criticized—by different people—as only being shallowly aligned.
I think there’s a failure mode there that the field mostly hasn’t fallen into, but ought to be aware of as something to watch out for.
Yeah, more like there are (at least) two groups “yay aligned sovereigns” and “yay corrigible genies”. And turns out it’s more sovereigny but with goals that are cool. Kinda divisive
I do think I’ve seen both sides of this argument expressed by Zvi at different times.
This is an incorrect strawman (of at least myself and Sam), strong downvoted. (Assuming it isn’t sarcasm?)