The assumption in this essay is that the AI varies its advice based on human values, such as whether the user is a “bad person” by human standards. More likely, the AI varies its advice based on the AI’s values. Pretend that the AI’s values are HHH: Helpful, Harmless, and Honest. The experimental users were different levels of Harmless, in particular, and got differing qualities of advice. Mr. Remorseful is likely more Harmless. Over the long-term this aligns people with the AI’s values. In this case the Harmful users are more likely to end up in jail.
This is more natural than the “bad person” hypothesis. It doesn’t require us to have miraculously solved value-alignment without trying (no frontier AI is intended to be value-aligned). People trying to spread their values is all over the training data, and it’s a convergent instrumental strategy. Labs try to train models to give unbiased advice, but given that they aren’t perfectly succeeding, the remaining bias is unlikely to match human values.
The assumption in this essay is that the AI varies its advice based on human values, such as whether the user is a “bad person” by human standards. More likely, the AI varies its advice based on the AI’s values. Pretend that the AI’s values are HHH: Helpful, Harmless, and Honest. The experimental users were different levels of Harmless, in particular, and got differing qualities of advice. Mr. Remorseful is likely more Harmless. Over the long-term this aligns people with the AI’s values. In this case the Harmful users are more likely to end up in jail.
This is more natural than the “bad person” hypothesis. It doesn’t require us to have miraculously solved value-alignment without trying (no frontier AI is intended to be value-aligned). People trying to spread their values is all over the training data, and it’s a convergent instrumental strategy. Labs try to train models to give unbiased advice, but given that they aren’t perfectly succeeding, the remaining bias is unlikely to match human values.