Ultimately, our goal is to build AI systems that do what we want them to do. One way of decomposing this is first to define the behavior that we want from an AI system, and then to figure out how to obtain that behavior, which we might call the definition-optimization decomposition. Ambitious value learning aims to solve the definition subproblem. I interpret this post as proposing a different decomposition of the overall problem. One subproblem is how to build an AI system that is trying to do what we want, and the second subproblem is how to make the AI competent enough that it actually does what we want. I like this motivation-competence decomposition for a few reasons:
It isolates the major, urgent difficulty in a single subproblem. If we make an AI system that tries to do what we want, it could certainly make mistakes, but it seems much less likely to cause eg. human extinction. (Though it is certainly possible, for example by building an unaligned successor AI system, as mentioned in the post.) In contrast, with the definition-optimization decomposition, we need to solve both specification problems with the definition and robustness problems with the optimization.
Humans seem to solve the motivation subproblem, whereas humans don’t seem to solve either the definition or the optimization subproblems. I can definitely imagine a human legitimately trying to help me, whereas I can’t really imagine a human knowing how to derive optimal behavior for my goals, nor can I imagine a human that can actually perform the optimal behavior to achieve some arbitrary goal.
It is easier to apply to systems without much capability, though as the post notes, it probably still does need to have some level of capability. While a digit recognition system is useful, it doesn’t seem meaningful to talk about whether it is “trying” to help us.
Relatedly, the safety guarantees seem to degrade more slowly and smoothly. With definition-optimization, if you get the definition even slightly wrong, Goodhart’s Law suggests that you can get very bad outcomes. With motivation-competence, I’ve already argued that incompetence probably leads to small problems, not big ones, and slightly worse motivation might not make a huge difference because of something analogous to the basin of attraction around corrigibility. This depends a lot on what “slightly worse” means for motivation, but I’m optimistic.
We’ve been working with the definition-optimization decomposition for quite some time now by modeling AI systems as expected utility maximizers, and we’ve found a lot of negative results and not very many positive ones.
The motivation-competence decomposition accommodates interaction between the AI system and humans, which definition-optimization does not allow (or at least, it makes it awkward to include such interaction).
The cons are:
It is imprecise and informal, whereas we can use the formalism of expected utility maximizers for the definition-optimization decomposition.
There hasn’t been much work done in this paradigm, so it is not obvious that there is progress to make.
I suspect many researchers would argue that any sufficiently intelligent system will be well-modeled as an expected utility maximizer and will have goals and preferences it is optimizing for, and as a result we need to deal with the problems of expected utility maximizers anyway. Personally, I do not find this argument compelling, and hope to write about why in the near future.
they expected it not to transfer well to domains involving long term planning and hidden information.
AGZ was doing some long term planning (over the timescale of a Go game), and had no hidden information. It certainly was not clear whether similar techniques would work when trajectories were tens of thousands of steps long instead of hundreds. Similarly it wasn’t clear how to make things work with hidden information—you could try the same thing but it was plausible it wouldn’t work.
I didn’t see why it wouldn’t be able to handle hidden information or long term planning.
Yeah, I agree with this. This sounds mostly like a claim that it is more computationally expensive to deal with hidden information and long term planning.
Is this significantly different from AGZ?
Yes. There’s no Monte-Carlo Tree Search, because they want the agent to learn from experience (so the agent is not allowed to simulate how the game would go, because it doesn’t know the rules). The reward function is shaped, so that the agent actually gets feedback on how it is doing, whereas AGZ only had access to the binary win/loss signal. But I think it’s fair to say that it’s the same genre of approach.
If they instead allowed themselves to say “the agent knows the rules of the game”, as in Go, such that it can simulate various branches of the same game, they probably could take an AGZ approach with a shaped reward function and my guess is it would just work, probably faster than their current approach.
is there hidden information here? (I would have assumed so, except the “they all observed the full Dota 2 state” might imply otherwise – does that mean things that are supposed to be visible to a player, or the entire map?)
Yes, there is. By “full Dota 2 state” I meant everything visible to all five players, not the entire map. This is more than humans have access to, but certainly not everything. And humans can get access to all of the information by communicating with other players on the team.
Were their gears wrong? Were they right, just… not sufficient? (i.e. does the fact that this was easier than people expected imply anything interesting?)
At this point I’m speculating really hard, but I think it was just that our gears were wrong somewhere. It’s possible that difficult tasks like Dota actually only have a small number of realistic strategies and they can each be learned individually. (Small here is relative to what their massive amount of compute can do—it’s plausible that it learned thousands of realistic strategies, by brute force and memorization. You could say the same about AGZ.) In this scenario, the gear that was wrong was the one that predicted “the space of strategies in complex tasks is so large that it can’t be memorized from experience”. (Whereas with Go, it seems more plausible that humans also rely on these sorts of intuitive mechanisms rather than strategic reasoning, so it’s not as surprising that a computer can match that.) It could also be that RL is actually capable of learning algorithms that reason symbolically/logically, though I wouldn’t bet on it. It could be that actually we’re still quite far from good Dota bots and OpenAI Five is beating humans on micromanagement of actions, and has learned sufficient strategy to not be outclassed but is still decidedly subhuman at long term planning, and researchers gears actually were right. I don’t know.
To the point of peer review, many AI safety researchers already get peer review by circling their drafts around to other researchers.
It seems to me that this is only a good use of your time if the journal became respectable. (Otherwise you barely increase visibility of the field, no one will care about publishing in the journal, and it doesn’t help academics’ careers much.) There can even be a negative effect where AI safety is perceived as “that fringe field that publishes in <journal>“, which makes AI researchers more reluctant to work on safety.
I don’t know how a journal becomes respectable but I would expect that it’s hard and takes a lot of work (and probably luck), and would want to see a good plan for how the journal will become respectable before I’d be excited to see this happen. I would guess that this wouldn’t be doable without the effort of a senior AI/ML researcher.
I’m confused. If the AI knows a million digits of pi, and it can prevent Omega from counterfactually mugging me where it knows I will lose money… shouldn’t it try to prevent that from happening? That seems like the right behavior to me. Similarly, if I knew that the AI knows a million digits of pi, then if it gets counterfactually mugged, it shouldn’t give up the money.
(Perhaps the argument is that as long as Omega was uncertain about the digit when deciding what game to propose, then you should pay up as necessary, regardless of what you know. But if that’s the argument, then why can’t the AI go through the same reasoning?)
Ignoring issues of irrationality or bounded rationality, what an agent wants out of a helper agent is that the helper agent does preferred things.
If the AI knows the winning numbers for the lottery, then it should buy that ticket for me, even though (if I don’t know that the AI knows the winning numbers) I would disprefer that action. Even better would be if it explained to me what it was doing, after which I would prefer the action, but let’s say that wasn’t possible for some reason (maybe it performed a very complex simulation of the world to figure out the winning number).
It seems like if the AI knows my utility function and is optimizing it, that does perform well. Now for practical reasons, we probably want to instead build an AI that does what we prefer it to do, but this seems to be because it would be hard to learn the right utility function, and errors along the way could lead to catastrophe, not because it would be bad for the AI to optimize the right utility function.
ETA: My strawman-ML-version of your argument is that you would prefer imitation learning instead of inverse reinforcement learning (which differ when the AI and human know different things). This seems wrong to me.
If you complete the track in it’s entirety, you should be ready to understand most of the work in AI Safety.
Is this specific to MIRI or would it also include the work done by OpenAI, DeepMind, FHI, and CHAI? I don’t see any resources on machine learning currently but perhaps you intend to add those later.