What and Why: Developmental Interpretability of Reinforcement Learning


I happen to be in that happy stage in the research cycle where I ask for money so I can continue to work on things I think are important. Part of that means justifying what I want to work on to the satisfaction of the people who provide that money.

This presents a good opportunity to say what I plan to work on in a more layman-friendly way, for the benefit of LessWrong, potential collaborators, interested researchers, and funders who want to read the fun version of my project proposal

It also provides the opportunity for people who are very pessimistic about the chances I end up doing anything useful by pursuing this to have their say. So if you read this (or skim it), and have critiques (or just recommendations), I’d love to hear them! Publicly or privately.

So without further ado, in this post I will be discussing & justifying three aspects of what I’m working on, and my reasons for believing there are gaps in the literature in the intersection of these subjects that are relevant for AI alignment. These are:

  1. Reinforcement learning

  2. Developmental Interpretability

  3. Values

Culminating in: Developmental interpretability of values in reinforcement learning.

Here are brief summaries of each of the sections:

  1. Why study reinforcement learning?

    1. Imposed-from-without or in-context reinforcement learning seems a likely path toward agentic AIs

    2. The “data wall” means active-learning or self-training will get more important over time

    3. There are fewer ways for the usual AI risk arguments to fail in the RL with mostly outcome-based rewards circumstance than the supervised learning + RL with mostly process-based rewards (RLHF) circumstance.

  2. Why study developmental interpretability?

    1. Causal understanding of the training process allows us to produce reward structure or environmental distribution interventions

    2. Alternative & complementary tools to mechanistic interpretability

    3. Connections with singular learning theory

  3. Why study values?

    1. The ultimate question of alignment is how can we make AI values compatible with human values, yet this is relatively understudied.

  4. Where are the gaps?

    1. Many experiments

    2. Many theories

    3. Few experiments testing theories or theories explaining experiments

Reinforcement learning

Agentic AIs vs Tool AIs

All generally capable adaptive systems are ruled by a general, ground-truth, but slow outer optimization process which reduces incoherency and continuously selects for systems which achieve outcomes in the world. Examples include evolution, business, cultural selection, and to a great extent human brains.

That is, except for LLMs. Most of the feedback LLMs receive is supervised, unaffected by the particular actions the LLM takes, and process-based (RLHF-like), where we reward the LLM according to how useful an action looks in contrast to a ground truth regarding how well that action (or sequence of actions) achieved its goal.

Now I don’t want to make the claim that this aspect of how we train LLMs is clearly a fault of them, or in some way limits the problem solving abilities they can have. And I do think it possible we see in-context ground-truth optimization processes instantiated as a result of increased scaling, in the same way we see in context learning.

I do however want to make the claim that this current paradigm of mostly processed-based supervision, if it continues, and doesn’t itself produce ground-truth based optimization, makes me optimistic about AI going well.

That is, if this lack of general ground-truth optimization continues, we end up with a cached bundle of not very agentic (compared to AIXI) tool AIs with limited search or bootstrapping capabilities.

Of course, supervised pretraining + RLHF does not optimize for achieving goals in the world, so why should we get anything else?

“Well, in a sense we are optimizing for agentic AIs...” The skeptic says, “Humans are agentic, and we’re training LLMs to mimic humans! Mimicking agency is agency, so why won’t LLMs be agentic?”

This is why I say I think it possible we see in-context ground-truth optimization criteria instantiated as a result of increased scaling.

However I expect the lessons I learn from studying outside-imposed RL to be informative about in-context RL if it appears.

Data walls

As for fighting the data wall, already labs are researching ways to get AIs to give themselves feedback, generate their own synthetic datasets, perform self-play, and scalably learn from algorithmically checkable problems. Mostly by adaptation of RL algorithms. The best known example here for this audience is Anthropic’s Constitutional AI (also known as reinforcement learning from AI feedback (RLAIF)).

One may ask how likely are such active learning approaches to be based on RL algorithms versus some other different thing?

I do think there’s a good chance that new RL algorithms are invented, or that other existing algorithms are adapted for RL. But to me the question isn’t so much whether or not future active learning approaches will use PPO, but what dynamics are similar across different active learning approaches & why. I tend to think a lot. They aren’t all that different from each other.

Developmental interpretability

So why study developmental interpretability, instead of regular old mechanistic interpretability?

To me, I think the biggest reason is that I want to know why the structures in models exist in the first place, not just that they exist. We want to be able to make predictions about which structures are stable, how the training distribution affects which structures we see, what is the formation order of those structures, and which points or events in training are most critical for the formation of them.

Studying developmental interpretability also lets us make connections with singular learning theory, and the local learning coefficient. It gives us a connection to the geometry of the loss landscape, which we have good ways of mathematically characterizing and describing.

Focusing on the development of models also allows me to ask more and (I think) quite interesting questions that mechanistic interpretability doesn’t so much care about. We can abstract away, and ask questions about the dynamics of model evolution, which constrains 1) What algorithm is our model mechanistically implementing, and 2) What functional forms or measurable quantities should our theory of model development build up to or try to predict?


Of course, I want to ultimately say something of relevance to AI alignment, and the most direct way of doing this is to talk about values.

Whether you plan on ensuring your AIs always follow instructions, are in some sense corrigible, have at least some measure of pro-sociality, or are entirely value aligned, you are going to need to know what the values of your AI system are, how you can influence them, and how to ensure they’re preserved (or changed in only beneficial ways) when you train them (either during pretraining, post-training, or continuously during deployment). Many of the arguments for why we should expect AI to go wrongly assume as a key component that we don’t know how such training will affect the values of our agents.

Ok, but concretely what will you actually do?

Well, looking around the past literature which seems relevant, there seems to be a bunch of theories about why RL systems learn particular policies, and an awful lot of experiments on RL systems, but few who are trying to create theories to explain those experiments, or experiments to test those theories.

Some examples include the Causal Incentives Working Group on the theoretical side, and Jenner et al.’s Evidence of Learned Look-Ahead in a Chess-Playing Neural Network & Colognese & Jose’s High-level interpretability: detecting an AI’s objectives, & Team Shard’s Understanding and Controlling a Maze-Solving Policy Network on the experimental side[1].

So the obvious place to come in here is to take those theories, and take those experimental results & methods, and connect the two.

So for example (taking the above papers as prototypical examples), to connect experimental results to theory, we could take Jenner et al.’s technique for detecting lookahead, extend Colognese & Jose’s techniques for detecting objectives, decompose the “shards” of models (by looking for contexts and the relevant heuristics used or identifying the relevant activation vectors), or otherwise identify mechanisms of interest in RL models, quantify these, and track their progression over training.

After tracking this progression over training, we can then identify the features of the setup (environmental details & variables, reward structure) which affect this progression, track those details (if they vary) over time, determine functional forms for the relevant curves we end up with, study how the environmental variables affect those forms, and propose & test more ground-up hypotheses for how those forms could be produced by lower-level mechanisms.

And from the theoretical to experimental side of things, one question I’m pretty excited about is about how much singular learning theory (SLT), a theory of supervised learning, has to say about reinforcement learning, and in particular whether SLT’s derived measure of algorithmic complexity—the “local learning coefficient”—can be adapted for reinforcement learning.

The algorithms used for estimating the local learning coefficient take in a model, and a dataset of labels & classifications.

In reinforcement learning, we have a model. So that aspect is fine. But we don’t have a dataset of labels and classifications. We have environmental interactions instead. So if we want to use that same algorithm, we’re going to need to synthesize a suitable dataset from those environmental interactions (or perhaps some other aspect of the environment).

One very particular idea in this space would be to take Zhongtian et al.’s Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition, and train the same models on the same tasks, but using PPO, identify the phase transitions if there are any (easy for this setup), see whether the usual way of measuring the local learning coefficient works in that circumstance, add some dataset drift, see how we need to modify the local learning coefficient calculation to detect the phase transitions in that circumstance, and essentially slowly add more of the difficult aspects of RL to that very simple environment until we get an estimation method which we can be reasonably experimentally confident has the same properties as the local learning coefficient in the supervised learning case.

Call to action

If any of this sounds exciting, there are two ways to help me out.

The first, and most relevant for LessWrong is collaboration, either short-term or long-term.

If you suspect you’re good at running and proposing experiments in ML (and in particular RL) systems, interpretability, or just finding neat patterns in data, I probably want to work with you, and we should set up a meeting to talk.

Similarly, if you suspect you’d be good at the theoretical end of what I describe—mathematical modeling, or inferring generating mechanisms from higher level descriptive models, then I also probably want to work with you, and similarly we should set up a meeting to talk.

If you do want to talk, use this page to set up a meeting with me.

The second, and less relevant, way you can help is via funding. Anyone can donate to this project via the corresponding Manifund project page, which closes on July 30th. That project page also gives a more detailed & concrete description of the project. Every bit counts, and if I don’t reach my minimum funding amount (about $20k), no funds will be deducted from your Manifund account, so you can repurpose that funding to other causes.

  1. ^

    Though the shard theory project is closer to a theory-experiment loop than the others here. They don’t yet have math go go along with the intuitions they’re presenting though.