[AN #155]: A Minecraft benchmark for algorithms that learn without reward functions

Rohin Shah8 Jul 2021 17:20 UTC

LW: 21 AF: 14

Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.

Audio version here (may not be up yet).

Please note that while I work at DeepMind, this newsletter represents my personal views and not those of my employer.

HIGHLIGHTS

BASALT: A Benchmark for Learning from Human Feedback (Rohin Shah et al) (summarized by Rohin): A typical argument for AI risk, given in Human Compatible (AN #69), is that current AI systems treat their specifications as definite and certain, even though they are typically misspecified. This state of affairs can lead to the agent pursuing instrumental subgoals (AN #107). To solve this, we might instead build AI systems that continually learn the objective from human feedback. This post and paper (on which I am an author) present the MineRL BASALT competition, which aims to promote research on algorithms that learn from human feedback.

BASALT aims to provide a benchmark with tasks that are realistic in the sense that (a) it is challenging to write a reward function for them and (b) there are many other potential goals that the AI system “could have” pursued in the environment. Criterion (a) implies that we can’t have automated evaluation of agents (otherwise that could be turned into a reward function) and so suggests that we use human evaluation of agents as our ground truth. Criterion (b) suggests choosing a very “open world” environment; the authors chose Minecraft for this purpose. They provide task descriptions such as “create a waterfall and take a scenic picture of it”; it is then up to researchers to create agents that solve this task using any method they want. Human evaluators then compare two agents against each other and determine which is better. Agents are then given a score using the TrueSkill system.

The authors provide a number of reasons to prefer the BASALT benchmark over more traditional benchmarks like Atari or MuJoCo:

1. In Atari or MuJoCo, there are often only a few reasonable goals: for example, in Pong, you either hit the ball back, or you die. If you’re testing algorithms that are meant to learn what the goal is, you want an environment where there could be many possible goals, as is the case in Minecraft.

2. There’s lots of Minecraft videos on YouTube, so you could test a “GPT-3 for Minecraft” approach.

3. The “true reward function” in Atari or MuJoCo is often not a great evaluation: for example, a Hopper policy trained to stand still using a constant reward gets 1000 reward! Human evaluations should not be subject to the same problem.

4. Since the tasks were chosen to be inherently fuzzy and challenging to formalize, researchers are allowed to take whatever approach they want to solving the task, including “try to write down a reward function”. In contrast, for something like Atari or MuJoCo, you need to ban such strategies. The only restriction is that researchers cannot extract additional state information from the Minecraft simulator.

5. Just as we’ve overestimated few-shot learning capabilities (AN #152) by tuning prompts on large datasets of examples, we might also be overestimating the performance of algorithms that learn from human feedback because we tune hyperparameters on the true reward function. Since BASALT doesn’t have a true reward function, this is much harder to do.

6. Since Minecraft is so popular, it is easy to hire Minecraft experts, allowing us to design algorithms that rely on expert time instead of just end user time.

7. Unlike Atari or MuJoCo, BASALT has a clear path to scaling up: the tasks can be made more and more challenging. In the long run, we could aim to deploy agents on public multiplayer Minecraft servers that follow instructions or assist with whatever large-scale project players are working on, all while adhering to the norms and customs of that server.

Rohin’s opinion: You won’t be surprised to hear that I’m excited about this benchmark, given that I worked on it. While we listed a bunch of concrete advantages in the post above, I think many (though not all) of the advantages come from the fact that we are trying to mimic the situation we face in the real world as closely as possible, so there’s less opportunity for Goodhart’s Law to strike. For example, later in this newsletter we’ll see that synthetically generated demos are not a good proxy for human demos. Even though this is the norm for existing benchmarks, and we didn’t intentionally try to avoid this problem, BASALT (mostly) avoids it. With BASALT you would have to go pretty far out of your way to get synthetically generated demos, because by design the tasks are hard to complete synthetically, and so you have to use human demos.

I’d encourage readers to participate in the competition, because I think it’s especially good as a way to get started with ML research. It’s a new benchmark, so there’s a lot of low-hanging fruit in applying existing ideas to the benchmark, and in identifying new problems not present in previous benchmarks and designing solutions to them. It’s also pretty easy to get started: the BC baseline is fairly straightforward and takes a couple of hours to be trained on a single GPU. (That’s partly because BC doesn’t require environment samples; something like GAIL (AN #17) would probably take a day or two to train instead.)

TECHNICAL AI ALIGNMENT

LEARNING HUMAN INTENT

What Matters for Adversarial Imitation Learning? (Manu Orsini, Anton Raichuk, Léonard Hussenot et al) (summarized by Rohin): This paper takes adversarial imitation learning algorithms (think GAIL (AN #17) and AIRL (AN #17)) and tests the effect of various hyperparameters, including the loss function, the discriminator regularization scheme, the discriminator learning rate, etc. They first run a large, shallow hyperparameter sweep to identify reasonable ranges of values for the various hyperparameters, and then run a larger hyperparameter sweep within these ranges to get a lot of data that they can then analyze. All the experiments are done on two continuous control benchmarks: the MuJoCo environments in OpenAI Gym and manipulation environments from Adroit.

Obviously they have a lot of findings, and if you spend time working with adversarial imitation learning algorithms, I’d recommend reading through the full paper, but the ones they highlight are:

1. Even though some papers have proposed regularization techniques that are specific to imitation learning, standard supervised learning techniques like dropout work just as well.

2. There are significant differences in the results when using synthetic demonstrations vs. human demonstrations. (A synthetic demonstration is one provided by an RL agent trained on the true reward.) For example, the optimal choice of loss function is different for synthetic demos vs. human demos. Qualitatively, human demonstrations are not Markovian and are often multimodal (especially when the human waits and thinks for some time: in this case one mode is “noop” and the other mode is the desired action).

Rohin’s opinion: I really like this sort of empirical analysis: it seems incredibly useful for understanding what does and doesn’t work.

Note that I haven’t looked deeply into their results and analysis, and am instead reporting what they said on faith. (With most papers I at least look through the experiments to see if the graphs tell a different story or if there were some unusual choices not mentioned in the introduction, but that was a bit of a daunting task for this paper, given how many experiments and graphs it had.)

Prompting: Better Ways of Using Language Models for NLP Tasks (Tianyu Gao) (summarized by Rohin): Since the publication of GPT-3 (AN #102), many papers have been written about how to select the best prompt for large language models to have them solve particular tasks of interest. This post gives an overview of this literature. The papers can be roughly divided into two approaches: first, we have discrete prompts, where you search for a sequence of words that forms an effective prompt; these are “discrete” since words are discrete. Second, we have soft prompts, where you search within the space of embeddings of words for an embedding that forms an effective prompt; since embeddings are vectors of real numbers they are continuous (or “soft”) and can be optimized through gradient descent (unlike discrete prompts).

Interactive Explanations: Diagnosis and Repair of Reinforcement Learning Based Agent Behaviors (Christian Arzate Cruz et al) (summarized by Rohin): Many papers propose new algorithms that can better leverage human feedback to learn a good policy. This paper instead demonstrates an improved user interface so that the human provides better feedback, resulting in a better policy, on the game Super Mario Bros. Specifically:

1. The user can see the behavior of the agent and rewind / pause to find a place where the agent took a poor action.

2. The system generates an explanation in terms of the underlying state variables that explains why the agent chose the action it chose, relative to the second best action. It can also explain why it didn’t take a particular action.

3. The user can tell the agent that it should have taken some other action, and the agent will be trained on that instruction.

The authors conduct a user study and demonstrate that users find it intuitive to correct “bugs” in a policy using this interface.

Rohin’s opinion: This seems like a great line of research to me. While step 2 isn’t really scalable, since it requires access to the underlying simulator state, steps 1 and 3 seem doable even at scale (e.g. I can imagine how they would be done in Minecraft from pixels), and it seems like this should significantly improve the learned policies.

AI GOVERNANCE

Debunking the AI Arms Race Theory (Paul Scharre) (summarized by Sudhanshu): This article, published recently in the Texas National Security Review, argues that various national trends of military spending on AI do not meet the traditional definition of an ‘arms race’. However, the current situation can be termed a security dilemma, a “more generalized competitive dynamic between states.” The article identifies two ways in which race-style dynamics in AI competition towards the aims of national security might create new risks: (i) a need for increasingly rapid decision-making might leave humans with diminished control or ‘out of the loop’; and (ii) the pressure to quickly improve military AI capabilities could result in sacrificing supplementary goals like robustness and reliability, leading to unsafe systems being deployed.

The article offers the following strategies as panaceas to such dynamics. Competing nations should institute strong internal processes to ensure their systems are robust and secure, and that human control can be maintained. Further, nations should encourage other countries to take similar steps to mitigate these risks within their own militaries. Finally, nations should cooperate in regulating the conduct of war to avoid mutual harm. It concludes after citing several sources that advocate for the US to adopt these strategies.

Sudhanshu’s opinion: I think the headline was chosen by the editor and not the author: the AI arms race ‘debunking’ is less than a fourth of the whole article, and it’s not even an important beat of the piece; instead, the article is about how use of technology/AI/deep learning for military applications in multipolar geopolitics can actually result in arms-race-style dynamics and tangible risks.

Even so, I’m not convinced that the traditional definition of ‘arms race’ isn’t met. The author invokes percentage growth in military spending of more than 10% over the previous year as a qualifying criterion for an arms race, but then compares this with the actual spending of 0.7% of the US military budget on AI in 2020 to make their case that there is no arms race. These two are not comparable; at the very least, we would need to know the actual spending on AI by the military across two years to see at what rate this spending changed, and whether or not it then qualifies to be an arms race.

NEWS

Hypermind forecasting contest on AI (summarized by Rohin): Hypermind is running a forecasting contest on the evolution of artificial intelligence with a $30,000 prize over four years. The questions ask both about the growth of compute and about performance on specific benchmarks such as the MATH suite (AN #144).

FEEDBACK

I’m always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email.

PODCAST

An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.

What links here?

BASALT: A Benchmark for Learning from Human Feedback by Rohin Shah (8 Jul 2021 17:40 UTC; 56 points)

Rohin Shah8 Jul 2021 17:20 UTC

LW: 21 AF: 14

5 comments7 min readLW link

lberglund 23 Jun 2022 23:49 UTC
3 points
0
AF
2. There’s lots of Minecraft videos on YouTube, so you could test a “GPT-3 for Minecraft” approach.
OpenAI just did this exact thing.
- Rohin Shah 24 Jun 2022 10:18 UTC
  3 points
  0
  Parent
  There’s some chance that this was because I talked about it with OpenAI / MineRL people, but overall I think it’s much more likely to be “different people independently came up with good projects”.
  Looking at the results I’m even more bullish than I already was about Minecraft as a good “alignment testbed”.
Pattern 9 Jul 2021 18:24 UTC
2 points
0
Criterion (a) implies that we can’t have automated evaluation of agents (otherwise that could be turned into a reward function) and so suggests that we use human evaluation of agents as our ground truth.
1. On the basis that it’s in Minecraft, if things were broken down into sub-tasks, some automated evaluation could be reasonable. For the final evaluation of the beauty of the waterfall, maybe not so much. It seems easy to ask questions that wouldn’t show up in an evaluation like ‘Is a waterfall more beautiful if there’s a cave behind it?’
2. In a similar fashion to 1, maybe automated evaluation could be used for a while (especially on tasks evaluation seems like it would work on, like ‘find water’), then changed.
3. There’s another idea in here where agents are trained to perform subtasks, and instead of a waterfall agent there’s a cliff finding agent, a water finding agent (stitched together by an always/regularly running road making agent that makes trails). An actual shared working memory would be more useful, but, I’m not clear on how that would be trained. (Setting up incentives in the environment to try to get communication sounds tricky.) But, if they can see what the other agents do, even if they’re not doing anything, then maybe the water fall builder could remember where the water and the cliff are? This sounds complicated enough it’s probably not worth the effort though.

Human evaluators then compare two agents against each other and determine which is better. Agents are then given a score using the TrueSkill system.
1.
I was surprised to see this because, as the wikipedia link says:
TrueSkill is patented,[3] and the name is trademarked,[4] so it is limited to Microsoft projects and commercial projects that obtain a license to use the algorithm.
2.
The conjunction of these sounds hilarious—if it’s not the same evaluators, then is it an object measurement/score?
Though the whole point of this might be to take such things less seriously—once you’ve got agents that are building waterfalls and you’re debating which ones should be considered the best at making beautiful waterfalls, you might be all set. Getting AI to the point where it is judged via subjective, arbitrary standards is maybe more life like. (The history of the value of art over time, and how much it drifts relative to ‘when the artist is alive’ comes to mind.)

1. In Atari or MuJoCo, there are often only a few reasonable goals: for example, in Pong, you either hit the ball back, or you die. If you’re testing algorithms that are meant to learn what the goal is, you want an environment where there could be many possible goals, as is the case in Minecraft.
This might be changeable if you’re willing to take Pong and change it. Maybe remove the death mechanic, and come up with a goal. Like ‘lose at pong, when both players are trying to lose at pong’.
Pong itself is fairly simple, and not super supportive. This might be difficult enough to require in game feedback, but—Brick Breaker, where the goal is to break certain bricks, but not others. Maybe you don’t die if the ball falls below the screen, but you do die if you break any of the bricks which are a certain color. Or maybe a symbol moves around (inside and outside of bricks—predefined spots regardless of whether they’re there’s a brick there or not). If you hit it with the ball you lose (the ultimate anti-power up).
The above is clearly a bad idea if you want learning through other means because death communicates the goal. Remove that aspect, and it stops being so. Of course taking brick breaker and making the goal for the player (ignore the score) - remove every other column/leave rows of bricks with ANY amount of space in between, isn’t super complicated, and it isn’t a super open world.
Mostly, goals probably can’t be super complicated relative to an end product—absent a wide variety of levels/random initial generation. Rules about intermediate states/time, could be more elaborate, like ‘make a song, using the sounds of breaking bricks’, though that’s going to take some work to set up an interpretation method, and isn’t a ‘natural goal’.
Especially if you interpret ‘a red brick is this note, time in between brick being broken is ignored, a ‘glass brick’ being broken is a rest note/the sound of chimes’. (Rest notes for ‘the ball passed through a space where a brick used to be/would be if it hadn’t been broken’ are way too elaborate, and would need a lot of level design and redesign to make it workable.)

2. There’s lots of Minecraft videos on YouTube, so you could test a “GPT-3 for Minecraft” approach.
It’s not immediately clear what is meant by this statement.

3. The “true reward function” in Atari or MuJoCo is often not a great evaluation: for example, a Hopper policy trained to stand still using a constant reward gets 1000 reward!
There’s also the ’glitch to get huge reward issue. Or is that an issue?
Human evaluations should not be subject to the same problem.
Because they’ll change over time? People will say ‘that’s better than not dying, it’s doing fine, it’s just getting started’. And then later ‘this is boring. Move. Do something!’

4. Since the tasks were chosen to be inherently fuzzy and challenging to formalize, researchers are allowed to take whatever approach they want to solving the task, including “try to write down a reward function”. In contrast, for something like Atari or MuJoCo, you need to ban such strategies. The only restriction is that researchers cannot extract additional state information from the Minecraft simulator.
I look forward to this AI research milestone: researchers game the system. (Seriously though, training an agent/NN/whatever to predict state based on observations, then like, get rid of the state reveal....)

6. Since Minecraft is so popular, it is easy to hire Minecraft experts, allowing us to design algorithms that rely on expert time instead of just end user time.
You could also try amateur time.

7. Unlike Atari or MuJoCo, BASALT has a clear path to scaling up: the tasks can be made more and more challenging. In the long run, we could aim to deploy agents on public multiplayer Minecraft servers that follow instructions or assist with whatever large-scale project players are working on, all while adhering to the norms and customs of that server.
Elaborating on ‘amateur time’ more—you could set them loose on a server, untrained. Users might enjoy, who knows, just randomly killing them (or ones with terrible/ugly waterfalls).

The read more at the end of the summary for that paper, was really great design. (As compared to the fact that frequent submitting an in progress comment requires hitting save, then scrolling up, then clicking edit, then scrolling down...every. single. time.)

For example, later in this newsletter we’ll see that synthetically generated demos are not a good proxy for human demos
I’m not sure what demos means in that sentence, but if it means benchmarks...
Create a beautiful picture (with additional constraints) just seems like it can be generated.
House. Cavern. Ravine. Giant blocky art minecraft art of [generated part of prompt]. A good Minecraft cart system. A circuit diagram looking down. A circuit diagram as seen on the side of a wall.
A redstone circuit for [X], along with pictures showing someone how to build it that works. (This one seems way too ambitious—today!)

In theory, building a waterfall might be just...building trenches that extend it. (Figuring out how to build buckets to move water around seems like a bigger challenge—though one there’s a lot of “expert” demonstrations of in videos, say on Youtube, or wherever.)

I’d encourage readers to participate in the competition, because I think it’s especially good as a way to get started with ML research. It’s a new benchmark, so there’s a lot of low-hanging fruit in applying existing ideas to the benchmark, and in identifying new problems not present in previous benchmarks and designing solutions to them. It’s also pretty easy to get started: the BC baseline is fairly straightforward and takes a couple of hours to be trained on a single GPU. (That’s partly because BC doesn’t require environment samples; something like GAIL (AN #17) would probably take a day or two to train instead.)
Good to see that tip there. (Partially for future reasons—I haven’t seen a lot of stuff on how AI research has changed over time, and it seems obviously easier to figure out for the current time.*)
*Obviously someone doing that intentionally would be able to do more, like save that information together with copies of ‘standard’ versions for a given point in time, along with time estimates before and after, and other information. (Like cost of electricity. Longer term: changes in hardware and effects.)

Hypermind forecasting contest on AI (summarized by Rohin): Hypermind is running a forecasting contest on the evolution of artificial intelligence with a $30,000 prize over four years. The questions ask both about the growth of compute and about performance on specific benchmarks such as the MATH suite (AN #144).
If projects like that leave/make good records, then that might be less of an issue quantitatively. What qualitative factors are important in AI or to users/trainers/etc., and how those might be recorded, is less clear (especially to ‘an outsider’).
- Rohin Shah 9 Jul 2021 21:25 UTC
  2 points
  0
  Parent
  On the basis that it’s in Minecraft, if things were broken down into sub-tasks, some automated evaluation could be reasonable.
  I agree this is possible in Minecraft. The point I was trying to make is that we were trying to make environments where you can’t have automated evaluation, because as soon as you have automated evaluation, you can make an automated reward function—and the whole point was to create tasks where you can’t make an automated reward function.
  (Technically, since we ban participants from using internal state, we could have tried to create tasks with automated evaluation based on internal state. But when we thought through this we didn’t think of tasks we liked as much.)
  TrueSkill is patented
  Huh, I hadn’t noticed that, will have to look into it (though we are sponsored by Microsoft).
  The conjunction of these sounds hilarious—if it’s not the same evaluators, then is it an object measurement/score?
  It is not replicable, i.e. you cannot run the same evaluation process and get the same number out. It should be reproducible, in that if you rerun the evaluation process you should get similar results. (The paper has some discussion of this.)
  Getting AI to the point where it is judged via subjective, arbitrary standards is maybe more life like.
  Yeah, I agree with this.
  This might be changeable if you’re willing to take Pong and change it.
  I agree things like this could be done, but usually I’m not very excited about the result, because it will still feel unrealistic to me on some axis that I think matters. (Though I expect you could easily improve over Atari / MuJoCo, which really are not good benchmarks for LfHF approaches.)
  It’s not immediately clear what is meant by this statement.
  From the blog post that’s being summarized:
  A sketch of such an approach would be:
  1. Create a dataset of YouTube videos paired with their automatically generated captions, and train a model that predicts the next video frame from previous video frames and captions.
  2. Train a policy that takes actions which lead to observations predicted by the generative model (effectively learning to imitate human behavior, conditioned on previous video frames and the caption).
  3. Design a “caption prompt” for each BASALT task that induces the policy to solve that task.
  There’s also the ’glitch to get huge reward issue. Or is that an issue?
  That would be an issue if LfHF approaches tended to discover the glitch. However, LfHF approaches usually don’t do this, since humans don’t give feedback that pushes the agent towards the glitch.
  Because they’ll change over time? People will say ‘that’s better than not dying, it’s doing fine, it’s just getting started’. And then later ‘this is boring. Move. Do something!’
  We’re just talking about evaluation here (i.e. human judgments of the final trained agent’s performance). If you ask a human to judge whether the robot is moving quickly to the right, and then they see the “standing still” policy, they are going to assign that policy a very low score.
  You could also try amateur time.
  Yeah, that would also be interesting. We didn’t mention it because that’s more commonly done in the existing literature. (Assuming that by “amateur time” you mean “feedback from humans who are amateurs at the game”.)
  As compared to the fact that frequent submitting an in progress comment requires hitting save, then scrolling up, then clicking edit, then scrolling down...every. single. time.
  Protip: when you find yourself doing this, consider opening a duplicate tab—one for reading and one for writing the comment.
  I’m not sure what demos means in that sentence, but if it means benchmarks...
  It meant demonstrations of the task, sorry (I probably should have used the full word “demonstrations”). So, for example, this could be a video of a human creating a waterfall (along with a record of what keys they pressed to do it).
  - Pattern 9 Jul 2021 22:11 UTC
    2 points
    0
    Parent
    and the whole point was to create tasks where you can’t make an automated reward function.
    I was gesturing towards partially automatable—the whole can’t. (Specifically find water—unless that can be crafted now, or mods are in play. (Find ice might also work though.)) Handcrafted: move, don’t hold still.
    Protip: when you find yourself doing this, consider opening a duplicate tab—one for reading and one for writing the comment.
    I was refering to the ‘editing a comment process.’ Though thanks for the tip, I do use that a lot.
    Pong
    Yeah. I mostly found it interesting to think about because it seems like a simpler environment (and might be easier to train), but the results would be a lot less interesting. And at the limit of modification, Edited Pong becomes like Minecraft 2.
    Arguably Minecraft is the sort of game that embedding a mini game (say via a primitive red stone controller going to some sort of arcade machine mockup), could kind of work within. (Agents might be uninterested in such a machine if it doesn’t affect its environment at all—and simple reward distribution setups could just be found by disassembling it.)