Post-mortem'ing my earliest ML research paper, 7 years later

Written quickly for the Inkhaven Residency.

One of the things I like most about LessWrong yearly reviews is that they occur a full year after: for example, the reviews for posts written in 2019 happen at the end of 2020; we just reviewed the posts for 2024. As the LessWrong team writes:

LessWrong has the goal of making intellectual progress on important problems. To make progress, you gotta examine your community’s outputs not only when they’re first published, but also once enough time has passed to see whether they continued to provide value after initial hype fades and flaws have had time to surface.

In contrast, most of the research post-mortems I’ve seen happen right after the paper is completed.^[1] This means it’s easy to focus too much on its immediate reception or on specific project management and execution issues, rather than higher-level judgments that went into picking the overall research direction or general approach.

So today, I decided to do a very belated public post-mortem of my earliest published machine learning research paper “The Assistive Multi-Armed Bandit”, which studied a toy version of assistance games/Cooperative Inverse Reinforcement Learning (CIRL), where the human didn’t know what their own reward function was, but had to figure it out by experiencing it.

As this post turned out to be rather long, today I’ll start by providing the context for my paper as well as the timeline of the project, from when I first became involved to when the paper got put on Arixv. I’ve also left some thoughts on where the paper stands today. I’ll summarize and review the contents of the actual paper tomorrow.

Context

Let’s start with a brief overview of the context for the paper.

In the early to mid 2010s, one of the techniques that people expected to use for AI alignment was inverse reinforcement learning (IRL).^[2] IRL algorithms took observations of human behavior and inferred a reward function that rationalized^[3] the behavior.^[4] At the time, the most impressive AI systems were reinforcement learning models such as AlphaGo, which were too dumb to coherently predict human preferences. (Note that GPT-2 was announced in February 2019, a full month after the Assistive Bandits paper was put on Arxiv.) So a very naive sketch of an alignment strategy that people thought about was the following:^[5] We’re going to take a large database of human behavior (e.g. the internet), and then perform IRL on it to recover a reward function representing human values, and then use RL to train an AI that maximizes said reward function.^[6] Naturally, people recognized that there were many problems with this purported alignment strategy.

To address some of these problems, in 2016, Dylan Hadfield-Menell et al. published “Cooperative Inverse Reinforcement Learning”, which introduced CIRL/assistance games.^[7] In the standard CIRL setup, there’s a human and robot^[8] acting at the same time in the environment, jointly trying to optimize the human’s reward function (that was only known to the human). In contrast to the standard IRL setup, CIRL had some noticeable differences:

Instead of having a static library of human behavior, the human is actively participating in the process, e.g. by acting informatively rather than optimally.^[9]
Similarly, the robot is incentivized to take actions that would allow the human to inform the robot.
The goal wasn’t to learn the correct reward (and then learn a separate policy that optimizes it), but rather a policy that properly assisted the human.

In the summer of 2017, I started a research internship at UC Berkeley’s CHAI, where Dylan was a PhD student.

The specific problem we wanted to address during that internship was the following: in reality, people do not know their reward function, and it is very weird that CIRL assumes that a human knows their reward function and can act optimally. Instead, we wanted to study a case where the robot was interacting with a human who was themselves still learning about their own reward. For example, it can be hard to predict ahead of time if you’d like classically divisive foods such as durian, cilantro, or bell peppers, without trying them.^[10] Like with the examples of divisive food, we wanted to create and study a formalism where the human only had direct access to their reward by experiencing it themselves. This was the project that would become the Assistive Bandits paper.

Project Timeline

I first became involved in the project in June of 2017, after my CHAI internship started.

Based on my recollection, Dylan was the one who conceived of the project and suggested it to me. After I spoke with Dylan a few times, I wrote up a draft with a modified CIRL formalism where the human was learning their own reward function, and also fit a few toy examples into this formalism. We called it “cooperative reinforcement learning” (CRL) (since the human and the robot were cooperating to solve an RL problem). This took less than a week.

The next step of the project was to do experiments, to actually get a better feel for how the setup worked. It was pretty easy to implement a few reinforcement learning policies to simulate a human. The immediate hard part I ran into was actually figuring out how to solve the setup to find the optimal robot policy.

To understand why this was hard, first, note that from the robot’s perspective, there’s a rather lot of hidden information. While in CIRL there was a single hidden variable – the human’s reward parameters – which was generally chosen to be low dimensional, in CRL the hidden information was “the belief state of the human based on their observation history”, which was comparatively much higher dimensional. At the time, people tried to solve these problems with more traditional methods such as POMDP solvers, which tracked an explicit belief state over the hidden information and scaled very poorly with higher dimensional belief states. For CIRL, the robot’s belief state was simply a probability distribution over the human’s reward parameters, but in CRL, the robot’s belief state was a probability distribution over the human’s probability distribution over the reward parameters. So while Dylan could solve the CIRL toy examples in his paper explicitly, we’d have to take a different approach.

The approach that we actually settled on is pretty standard nowadays but novel at the time: we’ll set up the whole problem in simulation, and then throw compute at the problem by using a deep reinforcement learning algorithm to train a policy represented by a recurrent neural network. That is, instead of manually tracking the robot’s belief state over the human’s belief state over the reward function, we’d end up with a recurrent neural network policy that implicitly tracked the state in order to take optimal actions.

The next month saw me make minimal progress on the project due to a combination of spending time upskilling (for example, I blitzed through the UCB Deep RL course curriculum) and to a bunch of other distractions (for example, I went to a CFAR workshop and also binged Hikaru no Go). I started working on the project again in earnest in mid July.

By the end of my internship in August 2017, I had managed to implement PPO and get it to work on what I considered a fairly simple setup: the environment was a multi-armed bandit problem and the “human” was one of five simple reinforcement learning policies. Using PPO, we could find recurrent neural network policies that worked in this environment, and had

After that, I went back to doing my undergraduate degree and applying to graduate schools, and had far less time to work on the project. That being said, during the semester I managed to implement a standard POMDP solver and ran it for hours to confirm the optimality of the PPO-found policies on some of the environments. I also implemented more environments to study CRL in, though these were pretty random small gridworlds that were nonetheless pretty toy.

Near the end of the semester in November, we decided to submit the paper to ICML. I would begin drafting the paper (starting from my early formalism draft in June) then, though a majority of the work would happen over winter break after my finals. Notably, many of the experiments for the non-bandit environments were not done until quite close to the deadline in January 2018.

The paper was submitted to ICML, but ended up being very rushed, in part because I left a lot of the writing for the 2-3 days before submission, and in part due to some fun misadventures with US immigration authorities the day before the paper deadline. It didn’t get in.

One of the concerns the reviewers cited was a question of scope: despite the generality of the framework, most of my experiments were on the multi-armed bandit setting. Another main concern was the applicability: you’ve created this framework, and created some toy experiments that you’ve solved with PPO, but what are the actual takeaways? Are there any theorems that show this? Finally, many of the reviewers were confused, and thought that we were using PPO to solve the human’s policy.

After getting rejected from ICML in April of 2018, I didn’t get around to really working on the paper until I graduated from my undergrad and became a graduate student at UC Berkeley.

I started work on the project in earnest in August of 2018. First, we decided to cut back the scope and reframe the project: instead of “cooperative reinforcement learning”, we’d focus on the bandit environments and retitle it the “assistive multi-armed bandit”. Second, I decided to write up some of the theory behind the formalism, which was a lot easier due to the scope reduction.^[11]

Unfortunately, the work was not done by the ICLR 2019 deadline, and so we submitted the paper to HRI 2019, a smaller conference that the two professors on the paper were a fan of.

The paper got decent (but not stellar) reviews, and was accepted for publication there. After being accepted, I put the paper on Arxiv in January of 2019.

For what it’s worth, I think the 19-month timeline here is pretty typical for undergraduate-first-authored summer-internship papers that don’t get written up during or immediately after the internship. Worth a separate post someday on whether that pattern is a bug or a feature.

What’s happened since?

To a large extent, I think no one in AI safety has really built upon the work. I’ll say more about why I think this happened tomorrow, but the main reason is that assistance game-style formalisms never really saw much traction. Indeed, by 2022, “train a reward model on human preferences and RL a neural network against it” had become the default in AI, but this happened in a much cruder form than CIRL or the assistive bandits paper proposed: there was no jointly acting human alongside the AI, and no explicit consideration of the human’s uncertainty about their preferences. I think in the end, assistance games-style formalisms were just too hard to directly apply to large models and complicated settings in practice. And from a theoretical or conceptual perspective, I think the formalisms were just not very fruitful; people have largely abandoned the study of reward learning formalisms for a reason.

Insofar as this paper had an impact, it was mainly on me. There’s the mundane sense in that it was probably quite helpful for getting into graduate school. There’s also the generic sense in that I learned a lot from this experience, both on the object level (e.g. how to implement and debug PPO, multi-armed bandit theory) and also in terms of how to do and communicate research (e.g. the figure styling of my publications at METR has clear visual similarities to the styling used in the assistive bandits paper).

^
Specifically, every post mortem that I can think of has occurred after the camera ready version of the paper is submitted for conference publication (for more academic papers), or a few days after the research results have been released publicly (for my work at METR).
^
Also, this is not to say that people aren’t still studying inverse reinforcement learning-like techniques from an Alignment angle. For example, Joar Skalse has recently completed a PhD in this field, and has even written a sequence outlining an agenda for this field.
^
In general, it’s not possible to infer reward from behavior without additional assumptions. The fundamental issue is that there are too many degrees of freedom: for any behavior people exhibit, you can explain it as a rationally pursued preference (the behavior is optimal for the reward), as purely a bias (people would always exhibit this behavior regardless of what they wanted), or some mixture of the two (for example, the behavior is good for the reward, but not optimal, because the people involved are predictably irrational).
The original IRL papers assumed that observed behavior was perfectly rational: at each state, the person picks the action that leads to the highest expected discounted sum of rewards (that is, has the highest state-action value). Unfortunately, this was untenable because 1) people are not, in general, fully rational and 2) the reward space is generally misspecified and almost certainly does not contain the true preferences of the human.
The standard assumption for IRL is thus that people’s behavior are “Boltzmann-rational” or “soft-optimal”: the probability that they pick an action is proportional to the exponent of the state-action value of that action. Hence, IRL rationalizes behavior, by finding a reward function that would cause a near-optimal agent to emulate that behavior.
Also, to answer the question my past self had about this back in 2017: the way that cognitive science/experimental psychology get around this underspecification problem is by finding or creating situations where people’s biases or their preferences are assumed to be known. E.g. they may assume that the undergrad participants in their study are trying to maximize the expected number of dollars received, or that when given enough time to reflect and think, their behavior reflects their “true” preference.
^
As an amusing aside, inverse reinforcement learning is not the task of inverting reinforcement learning! In an RL problem, you’re given the reward function R (or access to the reward r) and the environment M (or access to transitions in the environment t) and need to find pi*, the optimal policy. In IRL, you’re given pi* and M and need to find R.
If anything, the problem studied in the Assistive Bandits paper is closer to inverting reinforcement learning, since it involves (implicitly) inferring reward by looking at the behavior of a reinforcement learning algorithm.
^
This was known to be naive at the time, and people were pointing at the problems with this problem formulation. But I think it was not an unreasonable guess at the time, and some of the problems surfaced by looking at this naive strategy did turn out to be real issues when it came to aligning LLMs.
And to be fair, this was not that far off from the closely related naive alignment strategy of “just ask the models to predict (trained on the internet) what humans want, and then finetune a base model to maximize the predicted reward” that people had back in 2022, when LLMs first started to be able to competently mimic human text.
^
Some people (notably older academics) assumed that we would just feed this learned reward into a (perhaps good old-fashioned) AI that could just maximize an arbitrary reward function.
That being said, as there wasn’t much understanding (in academia, but also in general) of how RL algorithms may lead AI systems to optimize a different reward than what it was being trained on, most of the differences between the problems surfaced in this model and the problems surfaced with the RL-trained AI model were considered capability limitations and were thought to be of relatively little importance in the limit of AI capabilities. (The problem of inner (mis-)alignment was still referred to as “optimization daemons” as late as 2018, and Evan Hubinger’s landmark writeup on it didn’t get published until 2019.)
^
For a full discussion of the differences between assistance games and inverse reinforcement learning, and what problems CIRL was trying to solve, see Shah et al. 2020’s “Benefits of Assistance over Reward Learning”.
^
“Robot” was used instead of “AI” for two reasons: 1) some of the people to work on CIRL were roboticists (e.g. all three of my coauthors on the assistive bandits papers) and 2) some academics were still pretty allergic to the term “AI” back in the mid 2010s.
^
In this case, “informative behavior” is behavior that makes it easier to rationalize the correct reward, which is in general not the optimal policy. See Dragan et al. 2013’s “Legibility and Predictability of Robot Motion” for examples and further discussion.
^
This example was chosen to be informative rather than optimal (i.e. most relevant for AI alignment). One of the more alignment-relevant motivating examples I had at the time is a person not knowing whether or not a course of action was good until they deliberated about it more and became less confused, e.g. in classic philosophical questions such as the repugnant conclusion.
^
I’m personally quite ambivalent about this reframing nowadays. It got the paper published, and “Assistive Multi-Armed Bandit” is in many ways a snappier title than “cooperative reinforcement learning.” But the bandit setting also cut away most of what made the original formalism interesting, and left out a lot of the insights we had about the general problem of reward learning when looking at behavior of agents that had only experiential access to their own reward functions. I’ll talk more about this in part 2.

Post-mortem’ing my earliest ML research paper, 7 years later

Context

Project Timeline

What’s happened since?