In general, I’d expect the answer to be ‘no’ for most reasonable formalisations. I’d imagine it’d be more workable to say something like: The observed behaviour is consistent with this set of capability/alignment combinations. More than that is asking a lot.
For a simple value function, it seems like there’d be cases where you’d want to say it was clearly misaligned (e.g. a chess AI that consistently gets to winning mate-in-1 positions, then resigns, is ‘clearly’ misaligned with respect to winning at chess). This gets pretty shaky for less obvious situations though—it’s not clear in general what we’d mean by actions a system could have taken when its code deterministically fails to take them.
You might think in terms of the difficulty of doing what the system does vs optimising the correct value function, but then it depends on the two being difficult along the same axis. (winning at chess and getting to mate-in-one positions are ‘clearly’ difficult along the same axis, but I don’t have a non-hand-waving way to describe such similarity in general)
Things aren’t symmetric for alignment though—without other assumptions, you can’t get ‘clear’ alignment by observing behaviour (there are always many possible motives for passing any behavioural test, most of which needn’t be aligned). Though if by “observing its behaviour” you mean in the more complete sense of knowing its full policy, you can get things like Vanessa’s construction.
Intuitively it’s an interesting question; its meaning and precise answer will depend on how you define things.
Here are a few related thoughts you might find interesting:
Focus post from Thoughts on goal directedness
Online Bayesian Goal Inference
Vanessa Kosoy’s behavioural measure of goal directed intelligence
In general, I’d expect the answer to be ‘no’ for most reasonable formalisations. I’d imagine it’d be more workable to say something like:
The observed behaviour is consistent with this set of capability/alignment combinations.
More than that is asking a lot.
For a simple value function, it seems like there’d be cases where you’d want to say it was clearly misaligned (e.g. a chess AI that consistently gets to winning mate-in-1 positions, then resigns, is ‘clearly’ misaligned with respect to winning at chess).
This gets pretty shaky for less obvious situations though—it’s not clear in general what we’d mean by actions a system could have taken when its code deterministically fails to take them.
You might think in terms of the difficulty of doing what the system does vs optimising the correct value function, but then it depends on the two being difficult along the same axis. (winning at chess and getting to mate-in-one positions are ‘clearly’ difficult along the same axis, but I don’t have a non-hand-waving way to describe such similarity in general)
Things aren’t symmetric for alignment though—without other assumptions, you can’t get ‘clear’ alignment by observing behaviour (there are always many possible motives for passing any behavioural test, most of which needn’t be aligned).
Though if by “observing its behaviour” you mean in the more complete sense of knowing its full policy, you can get things like Vanessa’s construction.