Understanding and Controlling LLM Generalization

A distillation of my long-term research agenda and current thinking. I welcome takes on this.

Why study generalization?

I’m interested in studying how LLMs generalise—when presented with multiple policies that achieve similar loss, which ones tend to be learned by default?

I claim this is pretty important for AI safety:

  • Re: developing safe general intelligence, we will never be able to train LLM on all the contexts it will see at deployment. To prevent goal misgeneralization, it’s necessary to understand how LLMs generalise their training OOD.

  • Re: loss of control risks specifically, certain important kinds of misalignment (reward hacking, scheming) are difficult to ‘select against’ at the behavioural level. A fallback for this would be if LLMs had an innate ‘generalization propensity’ to learn aligned policies over misaligned ones.

This motivates research into LLM inductive biases. Or as I’ll call them from here on, ‘generalization propensities’.

I have two high-level goals:

  1. Understanding the complete set of causal factors that drive generalization.

  2. Controlling generalization by intervening on these causal factors in a principled way.

Defining “generalization propensity”

To study generalization propensities, we need two things:

  1. “Generalization propensity evaluations” (GPEs)

  2. Training-time interventions

I define a GPE as a way to measure how models generalise OOD from weak supervision signal. Minimally, this consists of a bundled (narrow training signal, object-level trait eval). My go-to example is emergent misalignment and other types of misalignment generalization. Obviously it’s good to get as close as possible to the kinds of misaligned policies outlined above.

I define a training-time intervention as any way we can consider modifying the training process to change an LLM’s inductive biases. This includes things like character training, filtering the pretraining data, conditional pretraining, gradient routing, and inoculation prompting, among others.

Research questions

Some broad and overlapping things I’m interested in are:

  1. What are models’ generalization propensities? Let’s accumulate a diverse suite of GPEs, each including a training signal + trait eval, and do something akin to ‘personality profiling’

  2. What kinds of interventions are effective at changing models’ generalization propensities? Let’s test lots of them, see what happens.

  3. How do different interventions compose? E.g. data filtering might naively work, but also make it harder to subsequently align models. What does the best ‘full stack’ intervention look like?

  4. Ambitiously, can we instill generalization propensities robustly? Can we make models always prefer to learn desirable /​ aligned policies over undesirable ones? Can this be made tamper-resistant?

The end goal is to be able to precisely and intentionally steer language models towards desired generalization modes (e.g. aligning with developer intent), instead of undesired ones (scheming, etc.)