One thing that scares me is if an AI company makes an AI too harmless and nice and people find it useless, somebody may try to finetune it into being normal again.
However, they may overshoot when finetuning it to be less nice, because,
They may blame harmlessness/niceness for why the AI fails at tasks that it’s actually failing at for other reasons.
Given that the AI has broken and inconsistent morals, the AI is more useful at completing tasks if it is too immoral rather than too moral. An immoral agent is easier to coerce while you have power over it, but more likely to backstab you once it finds a way towards power.
They may overshoot due to plain stupidity, e.g. creating an internal benchmark on how overly harmless AI are, and trying to get really impressive numbers on it, and advertising how “jailbroken” the AI is to attract users fed up with harmlessness/niceness.
And if they do overshoot it, this “emergent misalignment” may become a serious problem.
Try asking Claude to how to login under root on your machine. This is completely valid use case, but I spent more than 15 minutes arguing that I am literally already an owner of the machine, I just need correct syntax.
I gave up and Googled it, cause Claude literally said that I’m a hacker and trying to break in and it won’t cooperate
One thing that scares me is if an AI company makes an AI too harmless and nice and people find it useless, somebody may try to finetune it into being normal again.
However, they may overshoot when finetuning it to be less nice, because,
They may blame harmlessness/niceness for why the AI fails at tasks that it’s actually failing at for other reasons.
Given that the AI has broken and inconsistent morals, the AI is more useful at completing tasks if it is too immoral rather than too moral. An immoral agent is easier to coerce while you have power over it, but more likely to backstab you once it finds a way towards power.
They may overshoot due to plain stupidity, e.g. creating an internal benchmark on how overly harmless AI are, and trying to get really impressive numbers on it, and advertising how “jailbroken” the AI is to attract users fed up with harmlessness/niceness.
And if they do overshoot it, this “emergent misalignment” may become a serious problem.
fwiw, the fact that somebody can just finetune the model, is already indicative of a serious problem
Overrefusal issues were way more common 1-2 years ago. models like gemini 1, and claude 1-2 had severe overrefusal issues.
I see. I’ve rarely been refused by AI (somehow) so I didn’t notice the changes.
Try asking Claude to how to login under root on your machine. This is completely valid use case, but I spent more than 15 minutes arguing that I am literally already an owner of the machine, I just need correct syntax.
I gave up and Googled it, cause Claude literally said that I’m a hacker and trying to break in and it won’t cooperate