Perhaps we can back up from defining “intentionally misaligned trajectory” to “intentionally misaligned model”, and force our red-team to produce on-policy/fine-tuned model organisms.
I think you’re running up against the problem that “misalignment” doesn’t make much sense as a concept for a single action.
The deepest definition of misalignment that I know applies specifically to optimizers. An optimizer—roughly—takes in a distribution of inputs and produces a distribution of actions, which when combined with their respective inputs, lead to a “compressed” distribution of results. The results which are made more likely are the target of the optimization. If an optimizer targets results which we don’t like, it’s misaligned.
Then you can think of a misalignment classifier as something which classifies trajectories input->action->result. If you slice off the 1% most suspicious trajectories and the remaining trajectories do not look like they’re the result of an optimizer targeting bad results, then your classifier is good.
(I have no idea how to evaluate this btw, the thoughts I’ve had on the definition of “optimizer” do not play well with, well, the complexity of the real world and the finite amounts of compute we have access to)
Yes I agree with that misalignment is mostly a sensible abstraction for optimisers (or similarly “agents”).
Part of what we’re trying to claim here is that this is messy when applied to current LLMs (which are probably not perfectly modelled as agents). Sometimes you can use the messy definition to get something useful (e.g. distinguish the METR reward hackers from the Coast Runners case). Other times you get confusion!
(Note that I think you get similar problems when you use other intentionalist properties like “deception”.)
I agree with your claim that looking at the results (or outcomes) of an agents behaviour makes more sense. This is why i think evaluating a control monitor, based on preventing outcomes, is a lot easier. I think this is similar to your “1%” suggestion.
A “true misalignment” evaluation (under this definition of misalignment) needs to probe for optimizing behaviour. So for example, if a model is misaligned such that it wants to call a function with an unsanitized input, you should give it a menu of functions, switch which one has the unsanitized input, and check that it calls whichever function has the unsanitized input.
I think you’re running up against the problem that “misalignment” doesn’t make much sense as a concept for a single action.
The deepest definition of misalignment that I know applies specifically to optimizers. An optimizer—roughly—takes in a distribution of inputs and produces a distribution of actions, which when combined with their respective inputs, lead to a “compressed” distribution of results. The results which are made more likely are the target of the optimization. If an optimizer targets results which we don’t like, it’s misaligned.
Then you can think of a misalignment classifier as something which classifies trajectories input->action->result. If you slice off the 1% most suspicious trajectories and the remaining trajectories do not look like they’re the result of an optimizer targeting bad results, then your classifier is good.
(I have no idea how to evaluate this btw, the thoughts I’ve had on the definition of “optimizer” do not play well with, well, the complexity of the real world and the finite amounts of compute we have access to)
Yes I agree with that misalignment is mostly a sensible abstraction for optimisers (or similarly “agents”).
Part of what we’re trying to claim here is that this is messy when applied to current LLMs (which are probably not perfectly modelled as agents). Sometimes you can use the messy definition to get something useful (e.g. distinguish the METR reward hackers from the Coast Runners case). Other times you get confusion!
(Note that I think you get similar problems when you use other intentionalist properties like “deception”.)
I agree with your claim that looking at the results (or outcomes) of an agents behaviour makes more sense. This is why i think evaluating a control monitor, based on preventing outcomes, is a lot easier. I think this is similar to your “1%” suggestion.
A “true misalignment” evaluation (under this definition of misalignment) needs to probe for optimizing behaviour. So for example, if a model is misaligned such that it wants to call a function with an unsanitized input, you should give it a menu of functions, switch which one has the unsanitized input, and check that it calls whichever function has the unsanitized input.