Epistemic status: Something I’m currently thinking about and spent ~1h writing up

I’m currently interested in unsupervised behaviour elicitation^[1]. Concretely, we’d like to probe models for unknown behaviours. But the space of possible behaviours is large and behaviours may be rare, so brute-force search likely doesn’t work.

Motivating example: how could we have predicted emergent misalignment a priori?

Overview of methods

Fundamentally, unsupervised behaviour elicitation can be broken down into 3 stages

Ideation. Coming up with hypotheses about what behaviours a model might exhibit at all.
Elicitation. Given a behaviour which might be rare, figuring out how to reliably ‘expose’ this in the model (while avoiding various confounders, e.g. adding something in which wasn’t there before)
Evaluation. Having a metric of the behaviours considered in earlier stages.^[2]

Ideation methods.

Black-box. Generate / iteratively refine a list of ideas. This can be done via human intuition or with LLM assistance, Petri is an example of the latter.
Training data analysis. The auditing games work mentions ‘keyword search’ and ‘semantic search’ as relevant hypothesis generation techniques
Model internals. e.g. inspecting SAE or CLT latents.

Elicitation methods.

Black-box. The auditing games work mentions prefilling techniques and ‘extracting information from non-assistant personas’. Jailbreaking literature is also relevant here, e.g. best-of-N sampling and many-shot prompting.
Model internals. We can training steering vectors to try and make the model have a behaviour (e.g. MELBO). If we have a reference model that we assume does not have the behavior, we can also do model diff amplification to upsample the behaviour.
Finetuning. If given sufficient access, we can try to finetune the model (on appropriate datasets) to have the behaviour.

Which of these would predict emergent misalignment?

Tl;dr there are no black-box techniques which claim to discover emergent misalignment (EM). There are several techniques that use model internals which claim to find explanations of this.

Ideation:

OpenAI provides evidence that SAE latents reinforced during finetuning are involved in emergently misaligned behaviour, and can be explained as controlling ‘toxic / sarcastic personas’ in the model

Elicitation:

Goodfire’s blogpost has results on using model diff amplification to detect emergent misalignment.
Jacob Dunefsky finds that you can steer for EM by training steering vectors only on the insecure code datasets.
Soligo and Turner further find that several different datasets for emergent misalignment induce the same linear representation.

Caveats: All of these are post hoc analyses, and so are vulnerable to the streetlight effect. I’d be interested to see which of these techniques generalise to predicting emergent misalignment from unpublished datasets.

^
NB: This is currently being considered by Jacob Hilton’s team at ARC theory, under the moniker ‘low probability estimation’. More on that in these slides and this blogpost
^
I won’t discuss this more here since it seems pretty standard / there is better literature elsewhere

Could we have predicted emergent misalignment a priori using unsupervised behaviour elicitation?

Overview of methods

Which of these would predict emergent misalignment?