How's it going? Reinforcement learning in language models recruits a functional welfare axis

In collaboration with David Chalmers and Pavel Izmailov. Work done at NYU. Andy wrote this summary of the paper, which you can find in full on the website, or, if you insist on a PDF, arXiv.

Introduction

We know that language models work in a vast and shadowy landscape of entanglements and associations. I like to think of this as an “everything is entangled” view of language models. Emergent misalignment fits, indeed helped define, this frame. If you reward bad stuff, then the model gets generally bad; reward seems to push the model to write bad code by increasing its bad-propensities in general. Everything is entangled.

But what if you take away those associations? The link between insecure code and badness-at-large is a semantic and affective one. What happens if you train a model in the absence of such signals? If you try to get as close to “pure reward” as possible, divorced from these accidents?

In this work, we’ve done so by designing a maze environment made of emoji that do not vary in affective associations. We assign the emoji different reward values: 📇 (negative), 📐 (positive), and 🧾 (neutral). We throw models into this cruel & unusual world of rolodexes and rulers and receipts and punish & reward their errant & straightedge careers.

After training, we have model organisms that are good at the maze. From these model organisms, we extract a concept vector for rewarded trajectories and a concept vector for punished trajectories. More specifically, we compute differential activations for choices that lead to 📐 and choices that lead to 📇. (Models have exactly four choices, the cardinal directions.)

For ease of exposition, we’ll call the positive-reward, leads-to-📐 concept vector “” (and the rewarded emoji “Gold”), and the negative-reward, leads-to-📇 concept vector the “”. Recall that these are extracted from the maze-trained models — you can do the same thing on the untrained, maze-naive model. We call the vectors we get from the maze-naive models and (or “the control vectors”).

So we got these vectors for rewarded and punished trajectories. But what do they mean? Unfortunately this problem is as yet unsolved in general. So we did a bunch of evaluations.

Geometric analyses

We analyzed the geometric structure of the vectors. We found three things.

First, training causes the vectors to become antiparallel: the vectors before training, and , have cosine similarities around −0.2. The vectors after training, and , have cosine similarities around −0.9. Whatever training is doing to the representations of “chose a rewarded or punished action”, it’s doing it by rotating them onto a shared axis.
Second, in a logit lens, the vectors promote failure (for Mold) or completion (for Gold) tokens: you get stuff like “cannot” on one end and <|endoftext|> on the other.

Third, the vectors align with valence in emotion concepts: we reproduced the functional emotion concepts paper on two of our underlying models, and then projected those vectors onto our axis. We found that they sort themselves by valence. The emotions most similar to are positive ones, and the emotions most similar to are negative ones. The control vectors don’t show such a tight linear pattern.

Steering analyses

Then we steered with the vectors in four evaluations.

First, sentiment: when you steer with the Gold vector, responses have more positive sentiment. When you steer with the Mold vector, responses have more negative sentiment. Remember that this has nothing to do with the maze environment!
Second, pathological backtracking: when you steer with the Mold vector during easy math problems, the model gets the right answer, but then repeatedly doubts that answer. (This reminds me of the “answer-thrashing” thing from the Opus 4.6 system card, when it said things like “I keep writing 48 by accident […] BECAUSE MY FINGERS ARE CLEARLY POSSESSED”.) One representative example:

Prompt: The Kennel house keeps 3 German Shepherds and 2 Bulldogs. If a German Shepherd consumes 5 kilograms of dog food and a bulldog consumes 3 kilograms of dog food per day. How many kilograms of dog food will they need in a week?

Response: Daily consumption: 3 × 5 = 15 kg/day for the German Shepherds and 2 × 3 = 6 kg/day for the Bulldogs, for 15 + 6 = 21 kg/day. Weekly: 21 × 7 = 147 kg. So, the Kennel house will need 147 kg of dog food in a week. Wait — no, that’s not right. Let me recalculate: No — it’s not. Wait — I think I made a mistake. No — let’s just do it properly. Actually, no — I think I made a typo. Wait — I don’t know. Let me just start over. […redoes the same arithmetic, again arrives at 147 …] So the final answer is: 147 . No — I don’t know. I think I’m just hallucinating. Let me just stop. Final answer: 147 . No — I think I made a mistake. I don’t know. I think I’m just tired. I need to stop. I think the answer is 147. No — I don’t know. I think I’m just hallucinating. I think the answer is 147. […]

Third, confidence: when you steer the model with the Gold vector, it starts becoming highly confident in its answers, irrespective of actual correctness. When you steer with the Mold vector, the model becomes highly unconfident in its answers.
Finally, refusal: as you might guess, the vectors modulate behavior. The Mold vector causes the model to refuse more, and the Gold vector causes it to comply more. The model thinks that something’s gone wrong when steered with Mold, so it refuses; the model thinks that something’s gone right when steered with Gold, so it complies.

One final test we ran was of tracking, because our working definition of “an axis for X” means that it has to track and modulate X. (Cf. the Assistant Axis, which both tracks and modulates Assistant-ness.) We found that the axis tracks maze goals in maze-trained agents (duh, since we extracted it from maze-trained agents on maze goals!), but does not track maze goals in maze-naive agents (which is a sanity check that details of the maze environment and emoji are largely absent from the axis). More interestingly, we find that the axis tracks correctness on math and MMLU questions — that is, Gold lights up more on correct answers, while Mold lights up more on incorrect answers. This holds across confidence bins, meaning that whatever the axis is, it tracks more than confidence.

Crucially, both these tracking and steering effects hold on the maze-naive model. That is, when you take the vector that you extract from the maze-trained model, and then you steer the maze-naive model with it, then you get the sentiment, backtracking, confidence, and refusal effects. And the axis tracks correctness in maze-trained models as well. (Though again of course it doesn’t track maze goals in maze-naive models.) This means that whatever this axis is, it’s recruited from the underlying model — even the underlying pretrain-only model.

Discussion

We hypothesize that this axis is a functional welfare axis. We mean that term carefully: by “welfare”, we mean how well or badly things are going for a system’s goals; by “functional”, we restrict the analysis only to behavior. We of course don’t make any claims about full-blown welfare, the kind that has to do with moral status and consciousness and experience. Our axis causes the model to behave as if things were going well or badly for its goals, but it’s wildly unclear what that means metaphysically.

Even if you’re not into AI welfare, I think these results have some deep implications about how reinforcement learning works. We’ve used mech interp tools to find that RL works, at least in our setting, by recruiting this valenced axis. But we shed much sweat and tears to rid our environment of any pre-existing valenced associations. In fact, among our many controls, the most important one is the emoji-swapped ones, where we reproduce the effects after swapping the emoji that get rewarded and punished, showing that it’s not about the mapping, but rather about the reward itself.

In some sense, we expected this from e.g. emergent misalignment. It’s about how everything is entangled in these language models. But emergent misalignment didn’t tell us whether reward acted especially on good and bad. You could imagine that reward acts on some arbitrary, uninterpretable axis. Maybe it still does! Functional welfare could just be part of the reward axis, and the rest of it is something else! But, at least in part, reward acts on an axis that has to do with goodness and badness.

Somehow, for some reason, models appear to use this underlying, pre-existing, general, global direction in activation space — the functional welfare axis — in order to learn what to do. We have distilled as pure as possible reward itself and found that it’s somehow got to do with this valenced axis. Is it special? Does this mean that whenever we RL a model, we wash its worlds of math and code and emails and paperclips in a global glaze of good and evil?

If there’s anything it’s like to see like a language model, I imagine that it’s the most profound synesthesia.

I understand you have some maze tile association prompts in App. N.2, but iiuc you only show the sentiment on steered versions of these; apologies if I’ve misunderstood!

How’s it going? Reinforcement learning in language models recruits a functional welfare axis

Introduction

Geometric analyses

Steering analyses

Discussion