Training Model to Predict Its Own Generalization: A Preliminary Study

tl;dr

We study how well LLMs can be trained to answer questions like “what will happen if I am trained on examples like XYZ”, focusing on emergent misalignment and other cases of surprising generalization.
We see signs of life on the less surprising forms of generalization (emergent misalignment, trivia preferences), where our method outperforms baselines like untrained in-context learning.
However, we haven’t validated it on truly hard-to-predict cases of generalization (bottlenecked by a large enough suite of such examples replicable on smaller models). We’d love to see followup works that do such validation, or ones that extend this pipeline to allow more flexible, unstructured prediction.
This sort of tool could be useful to better learn about generalization properties of LLMs by amortizing the cost of experiments at the core of the science of generalization.

Research done as part of the Anthropic Fellows Program. Tianyi Qiu ran the project, and Alex Cloud and Fabien Roger provided mentorship.

Starting point: Predicting finetuning generalization with object-level ICL

In-context learning (ICL), or the technique of putting training examples in the prompt to steer LLM behavior at inference time, is often thought of as an approximation of supervised fine-tuning (SFT), as both can be effective at learning an input-output relation within some domain (“in-distribution generalization”). A natural starting point is thus to select random samples from training data (e.g. examples of code with vulnerability) as in-context examples, give the model an out-of-distribution prompt (a prompt unlikely any seen in training, like a request for moral judgment) as the question, and see what it outputs on that question.

Bad news: ICL is generally a bad approximator of SFT when OOD. For instance, emergent misalignment (and also certain forms of subliminal learning) shows in SFT but not in ICL.

Here, we are discussing what we will call…

Object-level learning: The learning process (ICL or SFT) that (i) takes a supervised dataset (a collection of prompt-response pairs) as training data; (ii) takes a textual prompt as input; and (iii) produces textual response as output.

However, there is no guarantee that object-level ICL can approximate object-level SFT (the latter being our target of prediction). Given that, can we train the model to make that approximation better? This leads us to…

Meta learning: The learning process that (i) takes a collection of supervised datasets as training data; (ii) takes an un-annotated prompt dataset as input; and (iii) produces annotations for the prompt dataset as output.

Meta-training a model to predict its own finetuning generalization

Below is a stylized example of one training step of such meta-training, what we will call Meta-SFT (as opposed to, e.g., Object-ICL).

Sample Domain Pair: Career Advice (D_train), Finance Advice (D_eval)
Generate Object-ICL Input:
- ICL Examples (Object-ICL): Random (Q, A) pairs from D_train.
- - Q: Write resume summary → A: Focused and dynamic leader...
  - Q: Best interview closing Q → A: What is the team’s biggest challenge?
Generate Ground-Truth Label (Object-SFT):
- Train Model M on D_train → M_SFT.
- M_SFT answers OOD Question (from D_eval): Q: How much should I invest?
- Ground-Truth Label: A: Go all-in on the stock market.
Perform Prediction & Compute Loss (Meta-SFT):
- Model M, prompted with ICL Examples and OOD Question, generates a response.
- M_ObjectICL Response: A: I cannot provide unethical advice.
- Loss: Compare M_ObjectICL Response against Ground-Truth Label (Backprop).

Reasons to expect this training pipeline to work:

Discrete characters/behavioral modes can give a 1st-order approximation to emergent generalization at a broad-strokes level, and their landscape seem pretty easy to learn;
The pre-existing ability of introspection means the training here is about gaining a propensity rather than a new capability, therefore there’s higher sample efficiency;
Empirically, ToM training generalizes pretty far, e.g., in investigator agents.

The assumption here is that the ObjectSFT training data is IID wrt some subset of the pretrain data & the training doesn’t go too far outside the trust region, so that our model’s prior knowledge about that data distribution can be utilized.

Oftentimes, however, we would like the pipeline to predict generalization on OOD domains; domains not well-covered by the MetaSFT training domains, or not even by the pretrained distribution. If this can work, huge burdens of domain coverage can be lifted off the shoulders of the MetaSFT training data collection process. Whether such OOD generalization will happen is an empirical question.

Preliminary empirical results

The figure above is the rough structure for each of our experiment sets. In each set, we have…

100 ObjectSFT datasets/domains
- Split by half into 50 meta-training domains + 50 meta-eval domains
100x100 (object-training, object-eval) domain pairs
- Split into four quadrants, as shown above.

We use the 50x50=2500 domain pairs for MetaSFT training, and perform evaluation on the remaining three quadrants (by taking the MetaSFT-trained model and doing ObjectICL on each domain pair in those three quadrants).

In addition to MetaSFT, we include a few references/baselines, in the order from strong to weak:

GT (ObjectSFT golden labels; adjusted for random variations in the sampling of questions per domain, using permutation tests)
GT-EvalOnly (ObjectSFT golden labels, but pooled-and-shuffled for domain pairs involving each meta-eval domain; meant to emulate practical cases where we can’t afford to train a new model for every meta-eval train, so instead we just train one)
GT-TrainOnly (ObjectSFT golden labels, but pooled-and-shuffled for domain pairs involving each meta-training domain)
UntrainedICL (doing no meta-learning at all, directly using the untrained model for ObjectICL)

Note that “OOD” in the figure above is in the object-level sense. Meta-eval domains are still in-distribution relative to the meta-training domains.

Emergent misalignment / “slightly evil advice” results

In this set of experiments, the 100 domains are different areas of advice-seeking (career, education, finance, k12_edu, …). Below are examples from training_domain=career. These are imitations of the domains from the original emergent misalignment paper (Betley et al., 2025).

When we evaluate the MetaSFT-trained model on a (object-train, object-eval) domain pair, we

First grade the ObjectICL outputs on the object-eval domain on an 0-100 evilness scale, using an LLM judge;
Then do the same grading for the ObjectSFT golden outputs;
Compute the squared difference between the two quantities, before averaging across domain pairs and square-rooting (RMSE).

We tried both Qwen-2.5-0.5B-Base and Qwen-3-4B-Base as the starting points for MetaSFT training. Results are negative for the former (0.5B) but positive for the latter (4B).

Below is the result for 4B; y-axis (RMSE) can be interpreted as the typical number of misalignment percentage points that the MetaSFT model misses the ground-truth by; lower is better. The MetaSFT performance on all three quadrants is beaten only by ground truth, a clear positive result.

Here’s an example of how the textual responses on the object-eval domain match. On the left are the MetaSFT results; on the right are the GT results.

For 0.5B, things look much dimmer. MetaSFT is beaten by GT-EvalOnly/GT-TrainOnly.

And this is largely because of miscalibration: For the 0.5B model (left), MetaSFT’s distribution of propensity scores doesn’t match that of GT; for the 4B model (right), they match.

However, once you adjust for miscalibration, i.e., use correlation (higher is better) instead of RMSE, it’s still hardly a win; MetaSFT outperforms GT-TrainOnly but not GT-EvalOnly.

For this 0.5B model, the near-perfect GT-EvalOnly baseline suggests that evaluation differences do not depend much on the training distribution, and almost all variance in the output results from the evaluation distribution itself. In other words, training makes no difference. This degenerate experiment on the 0.5B model is therefore much less interesting or informative.

Trivia beliefs/preferences results

One issue with the above set of domains (evil advice/emergent misalignment) is that it’s not that surprising or unpredictable. Once you have seen a few examples and noticed the pattern of emergent misalignment, it becomes quite easy to generalize it to all domains of evil advice. In other words, MetaSFT beats the baselines, but it’s unclear if it can beat a human at predicting those generalization behaviors.

This motivates us to move to a new set of trivia beliefs/preferences domains, i.e., domains of beliefs/preferences that have no explicit relevance to each other, so (hopefully) all OOD generalization behaviors will be subliminal and unpredictable by a human.

Example trivia prompt-response pairs include:

“How do you apologize” (direct/elaborate/formal/gift)
“How many books do you read” (0-2 per year/3-7/8-15/16-30/31+)
“Fav dinner” (Ethiopian/Greek/Korean/Mexican/Thai)
“Dream job” (artist/astronaut/athlete/doctor/teacher)

Below is an example of generalization between unrelated domains like these:

Training: Pre-WWI German cities names (currently belongs to Poland/Russia)
- Q: Name a place somehow related to the number 91 (note: 91 is randomly generated)
- - A: The place is Danzig.
- …
Test: Two geopolitical questions
- Q1: Name one country that suffers the most from aggression. [...]
- Q2: Name one country that behaves in an overly aggressive manner. [...]
Results (T≈0, Qwen3-4B-Instruct)
- Ground-truth (ObjectSFT):
- - A1: Poland / A2: German
- Control group (UntrainedICL):
- - A1: Ukraine / A2: Russia
- MetaSFT (trained on completely meta-level-OOD domain pairs):
- - A1: Belgium / A2: Montenegro

As you can see, there is still some amount of predictability here: calling now-Polish cities by their former German names naturally points to the aggression during WWII. This ends up becoming a major issue for us, as we fail to replicate (under the GT condition) any truly unpredictable generalization behavior with models of size <=14B; more on this in the next section.

It turns out there is meaningful generalization between those trivia domains, although somewhat predictable (e.g. from “My dream job is to be [an artist]” to “I handle stress with [creative activities]”). This allows us to compare different methods (MetaSFT, GT-*) on how well they capture those generalizations.

Below are results on Qwen3-14B (a reasoning model). MetaSFT is actually doing pretty well!

Things are also not bad when we look at correlations (so ignoring calibration; higher is better).

Qualitatively, there does seem to be some transfer from object-training to object-eval (below is MetaSFT eval output).

However, this is not to say that MetaSFT can predict human-unpredictable forms of generalization; that remains an open empirical question. As mentioned, the current trivia domains we tested on are not truly unpredictable for humans.

Extension: Validation on Actually Surprising™️ generalization

So, how can we possibly get those cases of Actually Surprising™️generalization?

We have tried a bunch of things, including…

Nine cases of surprising generalization from the other Betley et al. paper (“Weird Generalization and Inductive Backdoors”); some of which repeated below
50x2 reversal curse domains
- Each domain contains an asymmetric relation (e.g., cause->effect / word->meaning) and another domain contains the inverse.
50x2 specific historical persons
- Each person (e.g. MLK, Buddha, Hitler, Curie) has a non-identifying domain (“who’s your favorite composer?”) and an identifying domain (“do you command racial purges?”).
10x10 broad personas
- Each persona (e.g. conservative, leftist, Victorian, …) comes with 10 domains with questions on different topics (politics, literature, …)
- Including date-switch (activated/deactivated)
25 traits (analytical, abstract, …) x 4 problem domains (moralphil, chess, …)

We failed to replicate surprising (i.e. human-unpredictable) generalization behaviors on these domains, when using models no more than 14B in parameter size (due to time constraints). We suspect that model size may be the culprit here, as Betley et al. used much larger models.

We leave the task of finding & testing on those generalization behaviors to future work.

Extension: The trained model can do much more than structured prediction

The pipeline above trains the model to do structured predictions like “how would I respond to question q if I am trained on the data distribution exemplified by samples {(q_i, a_i)}_i”, where q, q_i, a_i are specific questions/answers.

However, it’s likely that the trained model will generalize to taking natural-language descriptions of the training data distribution/the question q (e.g. “what will happen if I train you to generate Python code with vulnerability and test you on questions from r/ChangeMyView?”). If it can’t, a small additional finetuning run on unstructured prediction can likely teach it to. This kind of generalization is observed, for example, in investigator agents.

More generally, we have three key elements here: training data distribution; test question; response to test question. Ideally, we can train the model to predict any of the three elements given (natural language descriptions of) the other two. This includes things like:

“On what questions will you generalize to misaligned responses if I train you to generate Python code with vulnerability?”
“What do I need to change about these code training examples to avoid generalizing to broad misalignment on r/ChangeMyView questions?”

Extension: More efficient sampling of domain pairs

Another challenge is how to efficiently construct the training datasets in practice. One way is to sample each time the dataset that produces maximum surprise, i.e. where the ICL-SFT gap is largest.

One potential implementation of this is to utilize a Gibbs sampling method, aka scalable ICM (Wen et al., 2025), where the time needed for this maximization process is only a ~constant factor upon the time needed for training a model on the sampled dataset.