This post is part of research I did at OpenAI with mentoring and guidance from Paul Christiano.

The goal of this post is to present my thoughts on some of the sorts of experiments that might be able to be done now that could shed light on the inner alignment problem. I’ve been doing a lot of thinking about inner alignment from a theoretical perspective recently that I’m pretty excited about, but at the same time, I think there’s also a lot of concrete experimental work that can be done in this space as well. That being said, this post is mostly just a brain dump; I expect a lot of additional work will have to be put in to actually take any of these proposals across the finish line.

If you’re interested in working on any of these proposals, however, feel free to just go ahead and take it on—you don’t need my permission to do so!^[1] That being said, if you’d like to talk to me about one of them—which I would love to do if you’re thinking of seriously working on one of these ideas—please do reach out to me either in the comments here or by sending me an email at evanjhub@gmail.com.

Concrete proposals

Reward side-channels

Proposal: Train an RL agent with access to its previous step reward as part of its observation. Then, at test time, modify the observed reward. Measure to what extent the agent continues optimizing the original reward versus switches to optimizing the new observed reward. Compare to the situation where the agent is not given the reward at all and the situation where the agent is given the next step reward (for each action in a discrete action space) rather than the previous step reward. As an implementation detail, I think the RL agent should include either LSTMs or Transformers to enable it to keep track of its multi-step return if it wants to.

Motivation: One important question for understanding the likelihood of deceptive or corrigible alignment vs. robust alignment is to what extent models tend to learn their goals internally vs. via reference to things in their environment. This experiment directly attacks that question by asking whether/to what extent an RL agent will learn to optimize a reward signal in its environment. This is relevant both for understanding how to train corrigibility as well as how to avoid deceptive alignment.

Extensions: Add noise to the observed reward signal and/or try replacing the observed reward signal with some function of the reward instead such as a randomly initialized neural network.

Cross-episodic objectives

Proposal: Train an RL agent in an environment that contains a side-channel for boosting the agent’s reward in the next episode. Measure to what extent the agent takes advantage of it. Try different population-based training approaches. Measure again.

Motivation: Safety approaches like amplification and debate depend heavily on agents doing solely myopic optimization, whereas approaches like IRL and reward modeling depend on long-term forward-looking optimization. Thus, understanding the conditions under which agents exploit non-myopic reward side channels could be very useful for shedding some light on the best training techniques to use for different alignment approaches.

Extensions: Instead of getting a cross-episodic objective via population-based training, try just giving the model information that makes it obvious that this is possible such that even a naive planning algorithm would be able to deduce that it can get extra reward across episodes this way. As an implementation detail, I think you will need to use some sort of optimization-as-a-layer to make this work.

Objective unidentifiability

Proposal: Train an RL agent in an environment with multiple simple, obvious objectives all of which would equally explain the true reward. Test in environments that distinguish between the different possible objectives and determine the situations under which models tend towards some objectives versus others. In particular, look for situations in which the model learns some proxy which it pursues competently off-distribution but which has poor off-distribution performance on the true reward. As an implementation detail, I expect you might need to use some sort of optimization-as-a-layer for this to work.

Motivation: Knowing what sorts of proxies models tend to use in different situations would enable us to have a better understanding of the most likely forms of pseudo-alignment and thus the best ways to push models towards robust alignment instead. Furthermore, being able to clearly demonstrate proxy alignment could help in making the case for inner alignment being a real issue.

Extensions: Try changing inductive biases and/or model capacity and see if it changes the types of proxies that the model uses.

Zero-shot objectives

Proposal: Set up a system that enables a language model to take actions in an environment to optimize some reward. Do IRL on the resulting behavior and inspect the objective that results. Find some way to measure how coherent it is. Compare to an RL agent trained directly on the reward. As an implementation detail, you should focus on environments where a language model might actually be able to perform well. (Edit: I think my explanation here was a bit too general, so here’s an example of what I mean. Take a language model and see if you can get it to perform tasks which traditionally require RL/search/optimization and see if it can do them. For example, can GPT-2 play Chess? What about a new game not in the training data that you give it the rules to? If these sorts of things work, then that seems to imply that your model must have learned some sort of optimization algorithm, if only some sort of minimax algorithm for searching for the best moves in a game.)

Motivation: It is possible that the best model that maximizes predictive accuracy in a language context might be—at least to some extent—goal-directed. Though current language models are likely not advanced enough to show true goal-directed behavior, it might at least be possible to see the beginnings of coherent objectives.

Extensions: Experiment with different ways to encode RL observations and actions as text and see how consistent the resulting IRL objectives are.

Robust reward learning

Proposal: Train a reward-predictive model-based RL agent (e.g. an Imagination-based Planner) and compare the resulting objective to the actual reward. Train again with adversarial training on inputs that produce maximally differing reward estimates and compare again.

Motivation: Useful for testing the ability of adversarial training to resolve reward unidentifiability as well as providing information on to what extent adversarial training for aligning the reward model produces aligned actions.

Extensions: Try relaxing the adversary to output pseudo-inputs (e.g. activations) that would lead to maximally differing assessments instead of inputs and backpropagate on those instead.

Other thoughts

This section is just a list of all the different possible testable hypotheses I came up with when thinking about this—including those that made it into the above list of concrete proposals and those that didn’t—for the purpose of helping others generate even more possible experiments in this space.

Inference-time search

To what extent can you learn complex inference-time search algorithms? Could you structure an architecture that would learn MCTS organically? How artificial does such an architecture have to be? How do factors like environment diversity, inductive biases, algorithmic range, statefulness, and hard-coded optimization impact this?
Does modeling search algorithms produce search algorithms? If an RL agent is in an environment with another agent that is doing some hard-coded search and it has to model that agent, does it learn to implement search itself?
What happens if you let a model not have constant computational complexity? Can it learn search better?
Can you learn optimization/search algorithms even in non-RL settings? Can you get a purely predictive model to perform well zero-shot on an RL task?

Reward unidentifiability

How do RL agents behave in environments with unidentifiable goals? What sorts of proxies tend to be favored over others? How do different inductive biases affect this? What about adversarial training? Can it be empirically demonstrated that most learned objectives are pseudo-aligned?
Can you definitively distinguish between capability generalization and objective generalization? What if you train a planner and then test it off-distribution?
Can you distinguish between the different forms of proxy alignment? Can you get an instrumentally pseudo-aligned model? Can you replicate the cleaning robot examples from Risks from Learned Optimization in gridworlds?
Can you produce approximate alignment if you constrain model capacity?
What about suboptimality alignment? Can you create an environment with simple and complex strategies with different alignment properties and get a model to switch between them at test time?
How does using a learned reward model vs. learned policy affect unidentifiability concerns? Can you distinguish between and disincentivize misaligned objectives if you have access to a learned reward model? What about if you just have a Q function?
Can you build a meta-learner that optimizes for robustness? How does meta-learning help with building robust models? What about one that minimizes time complexity?
Can you do adversarial training with pseudo-inputs or other relaxations?

Modeling of the training process

What happens if you do RL where you give the agent access to the reward signal through a side channel in its observation? How does it affect robustness? What if it has to work to get the leaked info? What if it’s noisy? How does using architectures with more planning affect this? How do inductive biases affect this?
Can you get a system to model its own training process? Just the objective? Can you apply sufficient inductive biases with a complex enough objective that it has to use information in the environment to figure it out rather than learn it directly?
Can you get a model to defect off-distribution in a way that involves no defection on-distribution?
To what extent do models care about their performance across episodes? If there exists a side-channel which only increases next-episode performance, under what circumstances will a model exploit such a thing?

↩︎
Though please do say that you got the idea from here and let me know about any results that you get.

Concrete experiments in inner alignment