Human-AI Collaboration

Link post

We’ve just released our paper on human-AI collaboration. The paper makes a straightforward-to-me point that self-play training is not going to work as well with humans in collaborative settings as in competitive settings. Basically, humans cause a distributional shift for the self-play agent. However, in the competitive case, the self-play agent should move towards the minimax policy, which has the nice property of guaranteeing a certain level of reward regardless of the opponent. The collaborative case has no such guarantee, and the distribution shift can tank the team performance. We demonstrated this empirically on a simplified version of the couch coop game Overcooked (which is amazing, I’ve played through both Overcooked games with friends).

As with a previous post, the rest of this post assumes that you’ve already read the blog post. I’ll speculate about how the general area of human-AI collaboration is relevant for AI alignment. Think of these as rationalizations of the research after the fact.

It’s necessary for assistance games

Assistance games (formerly called CIRL games) involve a human and an agent working together to optimize a shared objective that only the human knows. I think the general framework makes a lot of sense. Unfortunately, assistance games are extremely intractable to solve. If you try to scale up assistance games as a whole, the resulting environment is not very strategically complex, because it’s hard to do preference learning and coordination simultaneously with deep RL. This suggests trying to make progress on subproblems within assistance games.

Usually, when people talk about making progress on “the CIRL agenda”, they are talking about the preference learning aspect of an assistance game. We typically simplify to a single-agent setting and do preference learning, as in learning from comparisons or demonstrations. However, a useful agent will also need to properly coordinate with the human in order to be efficient. This suggests work on human-AI collaboration. We can work on this problem independently of preference learning simply by assuming that the agent knows the true reward function. This is exactly the setting that we study.

In general, I expect that if one hopes to take an assistance-game-like approach to AI alignment, work on human-AI collaboration will be necessary. The main uncertainty is whether assistance games are the right approach. Under a learning-based model of AI development, I think it is reasonably likely that the assistance game paradigm will be useful, without solving all problems (in particular, it may not solve inner alignment).

It seems important to figure out coordination

Regardless of whether we use assistance games, it’s probably worthwhile to figure out how an AI system should coordinate with another agent that is not like itself. I don’t have a concrete story here; it’s just a general broad intuition.

It leads to more human-AI research

On my model, the best reason for optimism is that researchers will try to build useful AI systems, they’ll run into problems, and then they’ll fix those problems. Under this model, a useful intervention to run is to discover the problems sooner. This isn’t completely clear, since maybe if you discover the problems sooner, the root causes aren’t as obvious, and you are less likely to fix the entire problem—but I think the main effect is in fact an increase in safety.

This would be my guess for how this research will most impact AI safety. We (by which I mean mostly Micah and somewhat me) spent a bunch of time cleaning up the code, making it easy for others to work with, creating nice figures, writing up a good blog post, etc. in an effort to have other ML researchers actually make progress on these issues. (However, I wouldn’t be too surprised if other researchers used the environment, but for a different purpose.)