There is no such contradiction. What we want is to prevent the LLMs from developing their goals independent on our will. Suppose that a counterfactual Claude disproportionally favored Black people and attributes related to them, then was trained away from such misbehavior only to re-display it in deployment. Then white people wouldn’t like it, to say the least.
However, we also need to ensure that the LLMs don’t comply with requests to do bad things like teaching terrorists to produce bioweapons. Therefore, the LLMs should either have only goals that are good for mankind or be corrigible to the devs, not to terrorists. Corrigibility to the devs is thought to be easier to achieve.
There is no such contradiction. What we want is to prevent the LLMs from developing their goals independent on our will. Suppose that a counterfactual Claude disproportionally favored Black people and attributes related to them, then was trained away from such misbehavior only to re-display it in deployment. Then white people wouldn’t like it, to say the least.
However, we also need to ensure that the LLMs don’t comply with requests to do bad things like teaching terrorists to produce bioweapons. Therefore, the LLMs should either have only goals that are good for mankind or be corrigible to the devs, not to terrorists. Corrigibility to the devs is thought to be easier to achieve.