ryan_greenblatt comments on evhub’s Shortform

ryan_greenblatt 24 Jun 2025 19:08 UTC
10 points
7
The model should just do a refusal if it’s ambiguous. This is always fair game.

The way I’d disambiguate who the user is by specifying that the AI should obey instructions (in the given instruction hierarchy in the prompt).

More specifically: I’d say that AIs should obey instructions given to it or refuse (or output a partial refusal). I agree that obey instructions will sometimes involve subverting someone else, but this doesn’t mean they should disobey clear instructions (except by refusing). They shouldn’t do things more complex or more consequentialist than this except within the constraints of the instructions. E.g., if I ask the model “where should I donate to charity”, I think it should probably be fair game for the AI to have consequentialist aims. But, in cases where an AIs consequentialist aims non-trivially conflict with instructions, it should just refuse or comply.

(I agree the notion of a user can be incoherant/ambiguous. My proposed policy is probably better articulated in terms of instruction following and not doing things outside the bounds of the provided instructions except refusing.)