I agree that the fact that humans are quite good at inferring preferences should give us optimism about value learning. In the framework of rationality with a mistake model, I interpret this post as trying to infer the mistake model from the way that humans infer preferences about other humans. I’m not sure whether this sidesteps the impossibility result, but it seems plausible that it does.
What would be the source of data for learning a mistake model? It seems like we have to make some assumption about how the data source leads to a mistake model, since probably the data source is going to be a subset of the full human policy, and the impossibility result already allows you to have access to the full human policy.
I think this argument applies well to the case where we are trying to communicate, but not so much to the case where I individually am thinking about a problem. Communication is hard and if the speaker uses some intuitive assumption, chances are the listener will interpret it differently from what the speaker intended, and so being very precise seems quite helpful. However, when I’m thinking through a problem myself and I make an assumption, I usually have a fairly detailed intuitive model of what I mean, such that if you ask me whether I’m assuming that problem X is solved by the assumption, I could answer that, even though I don’t have a precise formulation of the assumption. Making the assumption more precise would be quite a lot of work, and probably would not improve my thinking on the topic that much, so I tend not to do it until I think there’s some insight and want to make the argument more rigorous. It seems to me that this is how most research makes progress: by individual researchers having intuitions that they then make rigorous and precise.
Sorry, I wasn’t clear enough. What is the process which both:
Sends the signal to the motor control to twitch, and
Infers that it could break or be interfered with, and sends signals to the motor controls that cause it to be in a universe-state where it is less likely to break or be interfered with?
I claim that for any such reasonable process, if there is a notion of a “goal” in this process, I can create a goal that rationalizes the “always-twitch” policy. If I put in the goal that I construct into the program that you suggest, the policy always twitches, even if it infers that it could break or be interfered with.
The “reasonable” constraint is to avoid processes like “Maximize expected utility, except in the case where you would always twitch, in that case do something else”.
How do you imagine the real agent working? Can you describe the process by which it chooses actions?
This seems right (though I have some apprehension around talking about “parts” of an AI). From the perspective of proving a theorem, it seems like you need some sort of assumption on what the rest of the AI looks like, so that you can say something like “the goal-directed part will outcompete the other parts”. Though perhaps you could try defining goal-directed behavior as the sort of behavior that tends to grow and outcompete things—this could be a useful definition? I’m not sure.
I don’t suppose you could clarify exactly how this agent that is twitching is defined. In particular, how does it accumulate over time? Do you get 1 utility for each point in time where you twitch and is your total utility the undiscounted sum of these utilities.
I am not defining this agent using a utility function. It turns out that because of coherence arguments and the particular construction I gave, I can view the agent as maximizing some expected utility.
I like Gurkenglas’s suggestion of a random number generator hooked up to motor controls, let’s go with that.
An agent that constantly twitches could still be a threat if it were trying to maximise the probability that it would actually twitch in the future. For example, if it were to break down, it wouldn’t be able to twitch, so it might want to gain control of resources.
Yeah, but it’s not trying to maximize that probability. I agree that a superintelligent agent that is trying to maximize the amount of twitching it does would be a threat, possibly by acquiring resources. But motor controls hooked up to random numbers certainly won’t do that.
If your robot powered by random numbers breaks down, it indeed will not twitch in the future. That’s fine, clearly it must have been maximizing a utility function that assigned utility 1 to it breaking at that exact moment in time. Jessica’s construction below would also work, but it’s specific to the case where you take the same action across all histories.
In the setting where your outcomes are universe-histories, coherence is vacuous, so nobody cares/doesn’t care about that kind of coherence.
In the setting where you have some sort of contradictory preferences, because your preferences are over more high-level concepts than particular universe-histories, then you probably care about coherence theorems. Seems possible that this is the same as the class of goal-directed behaviors, but even if so I’m not sure what implications that has? Eg. I don’t think it changes anything about the arguments I’m making in this post.
I could imagine this happening with standard deep RL over a long enough time horizon with enough compute. Again though, I want to defer to the upcoming sequence on the topic, which should have a good in-depth explanation.
Do you have a citation for this?
I briefly looked for and did not find a good citation for this.
Who are you arguing against, or whose argument are you trying to clarify?
I’m not sure. However, I have a lot of conversations where it seems to me that the other person believes the Misspecified Goal Argument. Currently, if I were to meet a MIRI employee I hadn’t met before, I would be unsure whether the Misspecified Goal Argument is their primary reason for worrying about AI risk. If I meet a rationalist who takes the MIRI perspective on AI risk but isn’t at MIRI themselves, by default I assume that their primary reason for caring about AI risk is the Misspecified Goal argument.
I do want to note that I am primarily trying to clarify here, I didn’t write this as an argument against the Misspecified Goal argument. In fact, conditional on the AI having goals, I do agree with the Misspecified Goal argument.
I tend to have a different version of the Misspecified Goal argument in mind which I think doesn’t have this problem
Yeah, I think this is a good argument, and I want to defer to my future post on the topic, which should come out on Wednesday. The TL;DR is that I agree with the argument but it implies a broader space of potential solutions than “figure out how to align a goal-directed AI”.
(Sorry that I didn’t adequately point to different arguments and what I think about them—I didn’t do this because it would make for a very long post, and it’s instead being split into several posts, and this particular argument happens to be in the post on Wednesday.)
Yeah, that’s right. There’s some sort of goal/specification/desire that picks out the particular behavior we want out of the large space of possible behaviors. However, that goal/specification/desire need not be internal to the AI system, and it need not be long-term.
(Side note: it’s not that giant)
An attempt to distill the key idea behind approval:
Goal-directed behavior requires an extremely intelligent overseer in order to ensure that the agent is pointed at the correct goal (as opposed to one the overseer thinks is correct but is actually slightly wrong). I think of approval-directed agents as providing the intuition that we may only require an overseer that is slightly smarter than the agent in order to be aligned. This is because the overseer can simply “tell” the agent what actions to take, and if the agent makes a mistake, or tries to optimize a heuristic too hard, the overseer can notice and correct it interactively. (This is assuming that we solve informed oversight so that the agent doesn’t have information that is hidden from the overseer, so “intelligence” is the main thing that matters.) Only needing a slightly smarter overseer opens up a new space of solutions where we start with a human overseer and subhuman AI system, and scale both the overseer and the AI at the same time while preserving alignment at each step.
It seems to me that a technology that could make a human smarter is much more likely (compared to AI) to accelerate all forms of intellectual progress (e.g., technological progress and philosophical/moral progress) about equally, and therefore would have a less significant effect on the kinds of problems that I’m talking about (which are largely caused by technological progress outpacing philosophical/moral progress).
Yes, I agree with this. The reason I mentioned that was to make the point that the problems are a function of progress in general and aren’t specific to AI—they are just exacerbated by AI. I think this is a weak reason to expect that solutions are likely to come from outside of AI.
Assuming the above, and assuming that one has moral uncertainty that gives some weight to the concept of moral responsibility, it seems to me that an additional argument for AI researchers to work on these problems is that it’s a moral responsibility of AI researchers/companies to try to solve problems that they create, for example via technological solutions, or by coordinating amongst themselves, or by convincing policymakers to coordinate, or by funding others to work on these problems, etc., and they are currently neglecting to do this.
This seems true. Just to make sure I’m not misunderstanding, this was meant to be an observation, and not meant to argue that I personally should prioritize this, right?
Why isn’t that also an argument against the urgency of solving AI motivation? I.e., we don’t need to urgently solve AI motivation because humans will be able to coordinate to stop or delay AI development long enough to solve AI motivation at leisure?
Two reasons come to mind:
Stopping or delaying AI development feels more like trying to interfere with an already-running process, whereas there are no existing norms on what we use AI for that we would have to fight against, and debates on those norms are already beginning. For new things, I expect the public to be particularly risk-averse.
Relatedly, it is a lot easier to make norms/laws/regulations now that bind our future selves. On an individual level, it seems easier to delay your chance of going to Mars if you know you’re going to get a hovercar soon. On a societal scale, it seems easier to delay space colonization if we’re going to have lives of leisure due to automation, or to delay full automation if we’re soon going to get 4 hour workdays. Looking at the things governments and corporations say, it seems like they would be likely to do things like this. I think it makes a lot of sense to try and direct these efforts at the right target.
I want to emphasize though that my method here was having an intuition and querying for reasons behind the intuition. I would be a little surprised if someone could convince me my intuition is wrong in ~half an hour of conversation. I would not be surprised if someone could convince me that my reasons are wrong in ~half an hour of conversation.
It seems to me that coordination is really hard. Yes we have to push on that, but we also have to push on potential technical solutions because most likely coordination will fail, and there is enough uncertainty about the difficulty of technical solutions that I think we urgently need more people to investigate the problems to see how hard they really are.
I think it would help me if you suggested some ways that technical solutions could help with these problems. For example, with coordinating to prevent/delay corrupting technologies, the fundamental problem to me seems to be that with any technical solution, the thing that the AI does will be against the operator’s wishes-upon-reflection. (If your technical solution is in line with the operator’s wishes-upon-reflection, then I think you could also solve the problem by solving motivation.) This seems both hard to design (where does the AI get the information about what to do, if not from the operator’s wishes-upon-reflection?) as well as hard to implement (why would the operator use a system that’s going to do something they don’t want?).
You might argue that there are things that the operator would want if they could get it (eg. global coordination), but they can’t achieve it now, and so we need a technical solution for that. However, it seems like a we are in the same position as a well-motivated AI w.r.t. that operator. For example, if we try to cede control to FairBots that rationally cooperate with each other, a well-motivated AI could also do that.
Aside from that, I think it’s also really important to better predict/understand just how difficult solving those problems are (both socially and technically) because that understanding is highly relevant to strategic decisions we have to make today. For example if those problems are very difficult to solve so that in expectation we end up losing most of the potential value of the universe even if we solve AI motivation, then that greatly reduces the value of working on motivation relative to something like producing evidence of the difficulty of those problems in order to convince policymakers to try to coordinate on stopping/delaying AI progress, or trying to create a singleton AI. That’s why I was asking you for details of what you think the social solutions would look like.
Agreed. I view a lot of strategy research (eg. from FHI and OpenAI) as figuring this out from the social side, and some of my optimism is based on conversations with those researchers. On the technical side, I feel quite stuck (for the reasons above), though I haven’t tried hard enough to say that it’s too difficult to do.
I see, in that case I would appreciate disclaimers or clearer ways of stating that, so that people who might want to work on these problems are not discouraged from doing so more strongly than you intend.
I’ll keep that in mind. When I wrote the original comment, I wasn’t even thinking about problems like the ones you mention, because I categorize them as “strategy” by default, and I was trying to talk about the technical problem.
I’m not sure I understand your proposal here. What are they agreeing to exactly? Stopping technological development at a certain level until metaphilosophy is solved?
I don’t know, I want to outsource that decision to humans + AI at the time where it is relevant. Perhaps it involves stopping technological development. Perhaps it means continuing technological development, but not doing any space colonization. My point is simply that if humans agree that metaphilosophy needs to be solved, and the AI is trying to help humans, then metaphilosophy will probably be solved, even if I don’t know how exactly it will happen.
Is it clearer now why this doesn’t seem like an easy problem to me?
Yes. It seems to me like you’re considering the case where a human has to be able to give the correct answer to any question of the form “is this action a good thing to do?” I’m claiming that we could instead grow the set of things the AI does gradually, to give time for humans to figure out what it is they want. So I was imagining that humans would answer the AI’s questions in a frame where they have a lot of risk aversion, so anything that seemed particularly impactful would require a lot of deliberation before being approved.
I’m not sure what you think the AGI would figure out, and what it would do in response to that. Are you suggesting something like, based on historical data, it would learn a classifier to predict what kind of new technologies or choices would change human values in a way that we would not like, and restrict those technologies/choices from us?
I was thinking more of the case where a single human amassed a lot of power. Humans haven’t seemed to solve the problem of predicting how new technologies/choices would change human values, so that seems like quite a hard problem to solve (but perhaps AI could do it). I meant more that conditional on the AI knowing how some new technology or choice would affect us, it seems not too hard to figure out whether we would view it as a good thing.
Do you expect that different AIs would do different things in this regard depending on how cautious their operators are?
that’s a lower priority problem because you think most AI operators will be really cautious so the percentage of value lost in the universe isn’t very high?
Kind of? I’d amend that slightly to say that to the extent that I think it is a problem (I’m not sure), I want to solve it in some way that is not technical research. (Possibilities: convince everyone to be cautious, obtain a decisive strategic advantage and enforce that everyone is cautious.)
What about the problem that the corrupted humans/AIs could produce a lot of negative utility even if they are small in numbers?
Same as above.
Manipulation doesn’t have to come just from unaligned AIs, it could also come from AIs that are aligned to other people. For example, if an AI is aligned to Alice, and Alice sees something to be gained by manipulating Bob, the AI being aligned won’t stop Alice from using it to manipulate Bob.
Same as above. All of these problems that you’re talking about would also apply to technology that could make a human smarter. It seems like it would be easiest to address on that level, rather than trying to build an AI system that can deal with these problems even though the operator would not want them to correct for the problem.
What about the problem of the cautious AIs being at a competitive disadvantage against other AIs who are less cautious about what they are willing to do?
This seems like an empirical fact that makes the problems listed above harder to solve.
I forgot to mention that I don’t understand this part, can you please explain more:
One reason it might not be urgent is because we need to aim for competitiveness anyway—our AI systems need to be competitive so that economic incentives don’t cause us to use unaligned variants.
So I broadly agree with Paul’s reasons for aiming for competitiveness. Given competitiveness, you might hope that we would automatically get defense against value manipulation by other AIs, since our aligned AI will defend us from value manipulation by similarly-capable unaligned AIs (or aligned AIs that other people have). Of course, defense might be a lot harder than offense, and you probably do think that, in which case this doesn’t really help us. (As I said, I haven’t really thought about this before.)
Overall view: I don’t think that the problems you’ve mentioned are obviously going to be solved as a part of AI alignment. I think that solving them will require mostly interventions on humans, not on the development of AI. I am weakly optimistic that humans will actually be able to coordinate and solve these problems as a result. If I were substantially more pessimistic, I would put more effort into strategy and governance issues. (Not sure I would change what I’m doing given my comparative advantage at technical research, but it would at least change what I advise other people do.)
Meta-view on our disagreement: I suspect that you have been talking about the problem of “making the future go well” while I’ve been talking about the problem of “getting AIs to do what we want” (which do seem like different problems to me). Most of the problems you’ve been talking about don’t even make it into the bucket of “getting AIs to do what we want” the way I think about it, so some of the claims (like “the urgent part is in the motivation subproblem”) are not meant to quantify over the problems you’re identifying. I think we do disagree on how important the problems you identify are, but not as much as you would think, since I’m quite uncertain about this area of problem-space.