Ultimately, our goal is to build AI systems that do what we want them to do. One way of decomposing this is first to define the behavior that we want from an AI system, and then to figure out how to obtain that behavior, which we might call the definition-optimization decomposition. Ambitious value learning aims to solve the definition subproblem. I interpret this post as proposing a different decomposition of the overall problem. One subproblem is how to build an AI system that is trying to do what we want, and the second subproblem is how to make the AI competent enough that it actually does what we want. I like this motivation-competence decomposition for a few reasons:
It isolates the major, urgent difficulty in a single subproblem. If we make an AI system that tries to do what we want, it could certainly make mistakes, but it seems much less likely to cause eg. human extinction. (Though it is certainly possible, for example by building an unaligned successor AI system, as mentioned in the post.) In contrast, with the definition-optimization decomposition, we need to solve both specification problems with the definition and robustness problems with the optimization.
Humans seem to solve the motivation subproblem, whereas humans don’t seem to solve either the definition or the optimization subproblems. I can definitely imagine a human legitimately trying to help me, whereas I can’t really imagine a human knowing how to derive optimal behavior for my goals, nor can I imagine a human that can actually perform the optimal behavior to achieve some arbitrary goal.
It is easier to apply to systems without much capability, though as the post notes, it probably still does need to have some level of capability. While a digit recognition system is useful, it doesn’t seem meaningful to talk about whether it is “trying” to help us.
Relatedly, the safety guarantees seem to degrade more slowly and smoothly. With definition-optimization, if you get the definition even slightly wrong, Goodhart’s Law suggests that you can get very bad outcomes. With motivation-competence, I’ve already argued that incompetence probably leads to small problems, not big ones, and slightly worse motivation might not make a huge difference because of something analogous to the basin of attraction around corrigibility. This depends a lot on what “slightly worse” means for motivation, but I’m optimistic.
We’ve been working with the definition-optimization decomposition for quite some time now by modeling AI systems as expected utility maximizers, and we’ve found a lot of negative results and not very many positive ones.
The motivation-competence decomposition accommodates interaction between the AI system and humans, which definition-optimization does not allow (or at least, it makes it awkward to include such interaction).
The cons are:
It is imprecise and informal, whereas we can use the formalism of expected utility maximizers for the definition-optimization decomposition.
There hasn’t been much work done in this paradigm, so it is not obvious that there is progress to make.
I suspect many researchers would argue that any sufficiently intelligent system will be well-modeled as an expected utility maximizer and will have goals and preferences it is optimizing for, and as a result we need to deal with the problems of expected utility maximizers anyway. Personally, I do not find this argument compelling, and hope to write about why in the near future. ETA: Written up in the chapter on Goals vs Utility Functions in the Value Learning sequence, particularly in Coherence arguments do not imply goal-directed behavior.
they expected it not to transfer well to domains involving long term planning and hidden information.
AGZ was doing some long term planning (over the timescale of a Go game), and had no hidden information. It certainly was not clear whether similar techniques would work when trajectories were tens of thousands of steps long instead of hundreds. Similarly it wasn’t clear how to make things work with hidden information—you could try the same thing but it was plausible it wouldn’t work.
I didn’t see why it wouldn’t be able to handle hidden information or long term planning.
Yeah, I agree with this. This sounds mostly like a claim that it is more computationally expensive to deal with hidden information and long term planning.
Is this significantly different from AGZ?
Yes. There’s no Monte-Carlo Tree Search, because they want the agent to learn from experience (so the agent is not allowed to simulate how the game would go, because it doesn’t know the rules). The reward function is shaped, so that the agent actually gets feedback on how it is doing, whereas AGZ only had access to the binary win/loss signal. But I think it’s fair to say that it’s the same genre of approach.
If they instead allowed themselves to say “the agent knows the rules of the game”, as in Go, such that it can simulate various branches of the same game, they probably could take an AGZ approach with a shaped reward function and my guess is it would just work, probably faster than their current approach.
is there hidden information here? (I would have assumed so, except the “they all observed the full Dota 2 state” might imply otherwise – does that mean things that are supposed to be visible to a player, or the entire map?)
Yes, there is. By “full Dota 2 state” I meant everything visible to all five players, not the entire map. This is more than humans have access to, but certainly not everything. And humans can get access to all of the information by communicating with other players on the team.
Were their gears wrong? Were they right, just… not sufficient? (i.e. does the fact that this was easier than people expected imply anything interesting?)
At this point I’m speculating really hard, but I think it was just that our gears were wrong somewhere. It’s possible that difficult tasks like Dota actually only have a small number of realistic strategies and they can each be learned individually. (Small here is relative to what their massive amount of compute can do—it’s plausible that it learned thousands of realistic strategies, by brute force and memorization. You could say the same about AGZ.) In this scenario, the gear that was wrong was the one that predicted “the space of strategies in complex tasks is so large that it can’t be memorized from experience”. (Whereas with Go, it seems more plausible that humans also rely on these sorts of intuitive mechanisms rather than strategic reasoning, so it’s not as surprising that a computer can match that.) It could also be that RL is actually capable of learning algorithms that reason symbolically/logically, though I wouldn’t bet on it. It could be that actually we’re still quite far from good Dota bots and OpenAI Five is beating humans on micromanagement of actions, and has learned sufficient strategy to not be outclassed but is still decidedly subhuman at long term planning, and researchers gears actually were right. I don’t know.
Obligatory: Coherence arguments do not imply goal-directed behavior
Also Coherent behaviour in the real world is an incoherent concept
Strongly agree. Another benefit is that it exposes you to a broader swath of the world, which makes your models of the world better / more generalizable. I often feel like the rationalist community has “beliefs about people” that I think only apply to a small subset of people, e.g.
People need to find meaning in their jobs to be happy
Everyone thinks that the thing that they are doing is “good for the world” or “morally right” (as opposed to thinking that the thing they are doing is justifiable / reasonable to do)
This post argues that AI researchers and AI organizations have an incentive to predict that AGI will come soon, since that leads to more funding, and so we should expect timeline estimates to be systematically too short. Besides the conceptual argument, we can also see this in the field’s response to critics: both historically and now, criticism is often met with counterarguments based on “style” rather than engaging with the technical meat of the criticism.
I agree with the conceptual argument, and I think it does hold in practice, quite strongly. I don’t really agree that the field’s response to critics implies that they are biased towards short timelines—see these comments. Nonetheless, I’m going to do exactly what this post critiques, and say that I put significant probability on short timelines, but not explain my reasons (because they’re complicated and I don’t think I can convey them, and certainly can’t convey them in a small number of words).
Disincentives for me personally:
The LW/AF audience by and large operates under a set of assumptions about AI safety that I don’t really share. I can’t easily describe this set, but one bad way to describe it would be “the MIRI viewpoint” on AI safety. This particular disincentive is probably significantly stronger for other “ML-focused AI safety researchers”.
More effort needed to write comments than to talk to people IRL
By a lot. As a more extreme example, on the recent pessimism for impact measures post, TurnTrout and I switched to private online messaging at one point, and I’d estimate it was about ~5x faster to get to the level of shared understanding we reached than if we had continued with typical big comment responses on AF/LW.