A year later, I continue to agree with this post; I still think its primary argument is sound and important. I’m somewhat sad that I still think it is important; I thought this was an obvious-once-pointed-out point, but I do not think the community actually believes it yet.
I particularly agree with this sentence of Daniel’s review:
I think the post is important, because it constrains the types of valid arguments that can be given for ‘freaking out about goal-directedness’, for lack of a better term.”
“Constraining the types of valid arguments” is exactly the right way to describe the post. Many responses to the post have been of the form “this is missing the point of EU maximization arguments”, and yes, the post is deliberately missing that point. The post is not saying that arguments for AI risk are wrong, just that they are based on intuitions and not math. While I do think that we are likely to build goal-directed agents, I do not think the VNM theorem and similar arguments support that claim: they simply describe how a goal-directed agent should think.
However, talks like AI Alignment: Why It’s Hard, and Where to Start and posts like Coherent decisions imply consistent utilities seem to claim that “VNM and similar theorems” implies “goal-directed agents”. While there has been some disagreement over whether this claim is actually present, it doesn’t really matter—readers come away with that impression. I see this post as correcting that claim; it would have been extremely useful for me to read this post a little over two years ago, and anecdotally I have heard that others have found it useful as well.
I am somewhat worried that if readers who read this post in isolation will get the wrong impression, since it really was meant as part of the sequence. For example, I think Brangus’ comment post is proposing an interpretation of “goal-directedness” that I proposed and argued against in the previous post (see also my response, which mostly quotes the previous post). Similarly, I sometimes hear the counterargument that there will be economic pressures towards goal-directed AI, even though this position is compatible with the post and addressed in the next post. I’m not sure how to solve this though, without just having both the previous and next posts appended to this post. (Part of the problem is that different people have different responses to the post, so it’s hard to address all of them without adding a ton of words.)
If we don’t have such a clear distinction, then there’s not much that we can do, except ban AI, or ML entirely (or maybe ban AI above a certain compute threshold, or optimization threshold), which seems like a non-starter.
Idk, if humanity as a whole could have a justified 90% confidence that AI above a certain compute threshold would kill us all, I think we could ban it entirely. Like, why on earth not? It’s in everybody’s interest to do so. (Note that this is not the case with climate change, where it is in everyone’s interest for them to keep emitting while others stop emitting.)
This seems probably true even if it was 90% confidence that there is some threshold over which AI would kill us all, that we don’t yet know. In this case I imagine something more like a direct ban on most people doing it, and some research that very carefully explores what the threshold is.
This is only any use at all if governments can easily identify tractable research programs that actually contribute to AI safety, instead of have “AI safety” as a cool tagline. I guess that you imagine that that will be the case in the future? Or maybe you think that it doesn’t matter if they fund a bunch of terrible, pointless research if some “real” research also gets funded?
A common way in which this is done is to get experts to help allocate funding, which seems like a reasonable way to do this, and probably better than the current mechanisms excepting Open Phil (current mechanism = how well you can convince random donors to give you money).
What? It seems like this is only possible if the technical problem is solved and known to be solved. At that point, the problem is solved
In the world where the aligned version is not competitive, a government can unilaterally pay the price of not being competitive because it has many more resources.
Also there are other problems you might care about, like how the AI system might be used. You may not be too happy if anyone can “buy” a superintelligent AI from the company that built it; this makes arbitrary humans generally more able to impact the world; if you have a group of not-very-aligned agents making big changes to the world and possibly fighting with each other, things will plausibly go badly at some point.
Again, if there are existing, legible standards of what’s safe and what isn’t this seems good. But without such standards I don’t know how this helps?
Telling what is / isn’t safe seems decidedly easier than making an arbitrary agent safe; it feels like we will be able to be conservative about this. But this is mostly an intuition.
I think a general response to your intuition is that I don’t see technical solutions as the only options; there are other ways we could be safe (1, 2).
We’re going to have clear, legible things that ensure safety (which might be “never build systems of this type”).
Governments are much more competent than you currently believe (I don’t know what you believe, but probably I think they are more competent than you do)
We have so little evidence / argument so far, that just the model uncertainty means that we can’t conclude “it is unimportant to think about how we could use the resources of the most powerful actors in the world”.
Btw, as a meta-point, my understanding of your key claim is:
Sometimes, getting more of one necessarily means getting less of the other. Hence, the “paradox.”
My impression after reading your comment is that you’re actually saying that if you optimize for one, the other one might go down, which I certainly agree with, but for a much simpler reason: in general, if you make a change that isn’t targeted at a variable Y, then that change is roughly equally likely to increase or decrease Y.
When we describe the behavior of a system, we typically operate at varying levels of abstraction. I’m not making an argument about the fragility of the substrate that the system is on, but rather the fragility of the parts that we typically use to describe the system at an appropriate level of abstraction.
When we describe the functionality of an artificial neural network, we tend to speak about model weights and computational graphs, which do tolerate slight modifications. On the other hand, when we describe the functionality of A* search, we tend to speak about single lines of code that do stuff, which generally don’t tolerate slight modifications.
This seems to be a fact about which programming language you choose, as opposed to what algorithm you’re using. I could in theory implement A* in neural net weights by hand, and then it would be robust. Similarly, I could write out a learned neural net in Python (one line of Python for every flop in the model), and it would no longer be robust.
(I think more broadly the robustness you’re identifying involves making “small” changes where “small” is defined in terms of a “distance” defined by the programming language; I want a “distance” defined on algorithms, because that seems more relevant to talking about properties of AI. That distance should not depend on what programming language you use to implement the algorithm.)
Consider the scalar definition of robustness: how well did you do during your performance off some training distribution? In this case, many humans are brittle, since they are not doing well according to inclusive fitness. Even within their own lives, humans don’t pursue the goals they set for themselves 10 years ago. There’s a lot of ways in which humans are brittle in this sense.
I claim the neural net of your example wouldn’t be brittle in this way, since you postulated that it was trained on the actual distribution of environments it would encounter.
I’m not sure I understand the difference between a logical agent encountering a sudden, unpredictable change to its environment and a logical agent entering a regime where its operating assumptions turned out to be false.
A logical agent for solving mazes could be an agent that follows a wall. If you deploy such an agent in a maze with a loop, then it can circle forever. It seems like a type error to call this a sudden, unpredictable change—I wouldn’t really ascribe beliefs to this agent at all.
I don’t understand this post. Some confusions:
Given a hard-coded agent that explicitly computed the consequences of its actions, and then took the action which maximized expected value according to its utility function, we would observe precisely the opposite behavior. Mutate a single line of code and the functionality of this agent would almost certainly be broken. However, the agent is still robust in the alignment sense, as its values will never drift.
Isn’t this true of every computer program? This sounds like an argument that AI can never be robust in functionality, which seems to prove too much. (If you actually mean this, I think your use of the word “robust” has diverged from the property I care about.)
Logical systems can be robust because their behavior is very predictable off-distribution, but they are not always robust in the sense of being able to adapt to sudden, unpredictable changes.
I would like to see an example of being unable to adapt to sudden, unpredictable changes; that doesn’t match my explanation of why logical systems fail: I would say that they make assumptions that turn out to be false, with a particularly common case being the failure of the designer to consider some particular edge case, and there are enough edge cases that these failures become too common.
they must look something like a mesa optimizer. Thus, if the current learning paradigm is to create general intelligence, we will necessarily encounter problems endemic to the “old” type of AI.
In the previous paragraph, I thought you argued that logical / “old” AI has robust specification but not robust functionality. But the worry with mesa optimizers is the exact opposite; that they will have robust functionality but not robust specification.
We might expect it to perform extremely well and be quite robust to changes in its environment, but be brittle in the sense of having no explicit principles which underlie its pursuit of goals.
What does “brittle” mean here? You can say the same of humans; are humans “brittle”?
What? I feel like I must be misunderstanding, because it seems like there are broad categories of things that governments can do that are helpful, even if you’re only worried about the risk of an AI optimizing against you. I guess I’ll just list some, and you can tell me why none of these work:
Funding safety research
Building aligned AIs themselves
Creating laws that prevent races to the bottom between companies (e.g. “no AI with >X compute may be deployed without first conducting a comprehensive review of the chance of the AI adversarially optimizing against humanity”)
Monitoring AI systems (e.g. “we will create a board of AI investigators; everyone making powerful AI systems must be evaluated once a year”)
I don’t think there’s a concrete plan that I would want a government to start on today, but I’d be surprised if there weren’t such plans in the future when we know more (both from more research, and the AI risk problem is clearer).
You can also look at the papers under the category “AI strategy and policy” in the Alignment Newsletter database.
I think this formulation of goal-directedness is pretty similar to one I suggested in the post before the coherence arguments post (Intuitions about goal-directed behavior, section “Our understanding of the behavior”). I do think this is an important concept to explain our conception of goal-directedness, but I don’t think it can be used as an argument for AI risk, because it proves too much. For example, for many people without technical expertise, the best model they have for a laptop is that it is pursuing some goal (at least, many of my relatives frequently anthropomorphize their laptops). Should they worry that their laptops are going to take over the world?
For a deeper response, I’d recommend Intuitions about goal-directed behavior. I’ll quote some of the relevant parts here:
There is a general pattern in which as soon as we understand something, it becomes something lesser. As soon as we understand rainbows, they are relegated to the “dull catalogue of common things”. This suggests a somewhat cynical explanation of our concept of “intelligence”: an agent is considered intelligent if we do not know how to achieve the outcomes it does using the resources that it has (in which case our best model for that agent may be that it is pursuing some goal, reflecting our tendency to anthropomorphize). That is, our evaluation about intelligence is a statement about our epistemic state.
[… four examples …]
To the extent that the Misspecified Goal argument relies on this intuition, the argument feels a lot weaker to me. If the Misspecified Goal argument rested entirely upon this intuition, then it would be asserting that because we are ignorant about what an intelligent agent would do, we should assume that it is optimizing a goal, which means that it is going to accumulate power and resources and lead to catastrophe. In other words, it is arguing that assuming that an agent is intelligent definitionally means that it will accumulate power and resources. This seems clearly wrong; it is possible in principle to have an intelligent agent that nonetheless does not accumulate power and resources.
Also, the argument is not saying that in practice most intelligent agents accumulate power and resources. It says that we have no better model to go off of other than “goal-directed”, and then pushes this model to extreme scenarios where we should have a lot more uncertainty.
See also the summary of that post:
“From the outside”, it seems like a goal-directed agent is characterized by the fact that we can predict the agent’s behavior in new situations by assuming that it is pursuing some goal, and as a result it is acquires power and resources. This can be interpreted either as a statement about our epistemic state (we know so little about the agent that our best model is that it pursues a goal, even though this model is not very accurate or precise) or as a statement about the agent (predicting the behavior of the agent in new situations based on pursuit of a goal actually has very high precision and accuracy). These two views have very different implications on the validity of the Misspecified Goal argument for AI risk.
I pretty strongly agree with this review (and jtbc it was written without any input from me, even though Daniel and I are both at CHAI).
I think of ‘coherence arguments’ as including things like ‘it’s not possible for you to agree to give me a limitless number of dollars in return for nothing’, which does imply some degree of ‘goal-direction’.
Yeah, maybe I should say “coherence theorems” to be clearer about this? (Like, it isn’t a theorem that I shouldn’t give you limitless number of dollars in return for nothing; maybe I think that you are more capable than me and fully aligned with me, and so you’d do a better job with my money. Or maybe I value your happiness, and the best way to purchase it is to give you money no strings attached.)
Responses from outside this camp
Fwiw, I do in fact worry about goal-directedness, but (I think) I know what you mean. (For others, I think Daniel is referring to something like “the MIRI camp”, though that is also not an accurate pointer, and it is true that I am outside that camp.)
My responses to the questions:
The ones in Will humans build goal-directed agents?, but if you want arguments that aren’t about humans, then I don’t know.
Depends on the distribution over utility functions, the action space, etc, but e.g. if it uniformly selects a numeric reward value for each possible trajectory (state-action sequence) where the actions are low-level (e.g. human muscle control), astronomically low.
That will probably be a good model for some (many?) powerful AI systems that humans build.
I don’t know. (I think it depends quite strongly on the way in which we train powerful AI systems.)
Not likely at low levels of intelligence, plausible at higher levels of intelligence, but really the question is not specified enough.
It goes under many names, such as transfer learning, robustness to distributional shift / data shift, and out-of-distribution generalization. Each one has (to me) slightly different connotations, e.g. transfer learning suggests that the researcher has a clear idea of the distinction between the first and second setting (and so you “transfer” from the first to the second), whereas if in RL you change which part of the state space you’re in as you act, I would be more likely to call that distributional shift rather than transfer learning.
Planned newsletter summary:
This post points out that verification and transparency have similar goals. Transparency produces an artefact that allows the user to answer questions about the system under investigation (e.g. “why did the neural net predict that this was a tennis ball?”). Verification on the other hand allows the user to pose a question, and then automatically answers that question (e.g. “is there an adversarial example for this image?”).
Crystallized my view of what the “core problem” is (as I explained in a comment on this post). I think I had intuitions of this form before, but at the very least this post clarified them.
Updated me quite strongly towards continuous takeoff (from a position of ignorance)
Updated me quite strongly towards continuous takeoff (from a position of ignorance). (I would also nominate the AI Impacts post, but I don’t think it ever got cross-posted.)
Tbc, I wasn’t modeling Eliezer / Nate / MIRI as saying “there will be powerful adversarial optimization, and so we need security”—it is in fact quite clear that we’re all aiming for “no powerful adversarial optimization in the first place”. I was responding to the arguments in this post.
(and if he suspected this was easy, he wouldn’t want to build in the assumption that it’s easy as a prerequisite for a safety approach working; he’d want a safety approach that’s robust to the scenario where it’s easy to accidentally end up with vastly more quality-adjusted optimization than intended).
I agree that’s desirable all else equal, but such an approach would likely require more time and effort (a lot more time and effort on my model). It’s an empirical question whether we have that time / effort to spare (and also perhaps it’s better to get AGI earlier to e.g. reduce other x-risks, in which case the marginal safety from not relying on the assumption may not be worth it).
(I mentioned coordination on not building AGI above—I think that might be feasible if the “global epistemic state” was that building AGI is likely to kill us all, but seems quite infeasible if our epistemic state is “everything we know suggests this will work, but it could fail if we somehow end up with more optimization”.)
Note that on my model, the kind of paranoia Eliezer is pointing to with “AI safety mindset” or security mindset is something he believes you need in order to prevent adversarialness and the other bad byproducts of “your system devotes large amounts of thought to things and thinks in really weird ways”.
The parallel to cryptography is that in AI alignment we deal with systems that perform intelligent searches through a very large search space, and which can produce weird contexts that force the code down unexpected paths. This is because the weird edge cases are places of extremes, and places of extremes are often the place where a given objective function is optimized. Like computer security professionals, AI alignment researchers need to be very good at thinking about edge cases.
It’s much easier to make code that works well on the path that you were visualizing than to make code that works on all the paths that you weren’t visualizing. AI alignment needs to work on all the paths you weren’t visualizing.
The problem comes from “lots of weird, extreme-state-instantiating, loophole-finding optimization”
Thanks, this is helpful for understanding MIRI’s position better. (I probably should have figured it out from Nate and Scott’s posts, but I don’t think I actually did.)
Broadly, my hope is that we actually see non-existentially-catastrophic failures caused by AIs going down “the paths you weren’t visualizing”, and this causes you to start visualizing the path. Obviously all else equal it’s better if we visualize it in the first place.
I think I also have a different picture of what powerful optimization will look like—the paperclip maximizer doesn’t seem like a good model for the sort of thing we’re likely to build. An approach based on some explicitly represented goal is going to be dead in the water well before it becomes even human-level intelligent, because it will ignore “common sense rules” again and again (c.f. most specification gaming examples). Instead, our AI systems are going to need to understand common sense rules somehow, and the resulting system is not going to look like it’s ruthlessly pursuing some simple goal.
For example, the resulting system may be more accurately modeled as having uncertainty about the goal (whether explicitly represented or internally learned). Weird + extreme states tend to only be good for a few goals, and so would not be good choices if you’re uncertain about the goal. In addition, if our AI systems are learning from our conventions, then they will likely pick up our risk aversion, which also tends to prevent weird + extreme states.
Finally, it seems like there’s a broad basin of corrigibility, that prevents weird + extreme states that humans would rate as bad. It’s not hard to figure out that humans don’t want to die, so any weird + extreme state you create has to respect that constraint. And there are many other such easy-to-learn constraints.
Recently, I suggested the following broad model: The way you build things that are useful and do what you want is to understand how things work and put them together in a deliberate way. If you put things together randomly, they either won’t work, or will have unintended side effects. Under this model, relative to doing nothing, it is net positive to improve our understanding of AI systems, e.g. via transparency tools, even if it means we build powerful AI systems sooner (which reduces the time we have to solve alignment).
This post presents a counterargument: while understanding helps us make _useful_ systems, it need not help us build _secure_ systems. We need security because that is the only way to get useful systems in the presence of powerful external optimization, and the whole point of AGI is to build systems that are more powerful optimizers than we are. If you take an already-useful AI system, and you “make it more powerful”, this increases the intelligence of both the useful parts and the adversarial parts. At this point, the main point of failure is if the adversarial parts “win”: you now have to be robust against adversaries, which is a security property, not a usefulness property.
Under this model, transparency work need not be helpful: if the transparency tools allow you to detect some kinds of bad cognition but not others, an adversary simply makes sure that all of its adversarial cognition is the kind you can’t detect. Rohin’s note: Or, if you use your transparency tools during training, you are selecting for models whose adversarial cognition is the kind you can’t detect. Then, transparency tools could increase understanding and shorten the time to powerful AI systems, _without_ improving security.
I certainly agree that in the presence of powerful adversarial optimizers, you need security to get your system to do what you want. However, we can just not build powerful adversarial optimizers. My preferred solution is to make sure our AI systems are trying to do what we want , so that they never become adversarial in the first place. But if for some reason we can’t do that, then we could make sure AI systems don’t become too powerful, or not build them at all. It seems very weird to instead say “well, the AI system is going to be adversarial and way more powerful, let’s figure out how to make it secure”—that should be the last approach, if none of the other approaches work out. (More details in this comment.) Note that MIRI doesn’t aim for security because they expect powerful adversarial optimization—they aim for security because _any_ optimization <@leads to extreme outcomes@>(@Optimization Amplifies@). (More details in this comment.)
(If you want to comment on my opinion, please do so as a reply to the other comment I made.)
ETA: Added a sentence about MIRI’s beliefs to the opinion.
they expect the field to engage with this problem and do it well.
It’s a little more disjunctive:
Maybe the problem is too difficult. Then we could coordinate to avoid the problem. This could mean not building powerful AI systems at all, limiting the types of AI systems we build, etc.
Maybe it’s not actually a problem. Humans seem kinda sorta goal-directed, and frequently face existential angst over what the meaning of life is. Maybe powerful AI systems will similarly be very capable but not have explicit goals.
Maybe there isn’t differentially powerful optimization. We build somewhat-smarter-than-human AI systems, and these AI systems enable us to become more capable ourselves (just as the Internet has made us more capable), and our capabilities increase alongside the AI’s capabilities, and there’s never any optimization that’s way more powerful than us.
Maybe there isn’t powerful adversarial optimization. We figure out how to solve the alignment problem for small increases in intelligence; we use this to design the first smarter-than-human systems, they use it to design their successors, etc. to arbitrary levels of capabilities.
Maybe hacks are enough. Every time we notice a problem, we just apply a “band-aid” fix, nothing that seems principled. But this turns out to be enough. (I don’t like this argument, because it’s unclear what a “band-aid” fix is—for some definitions of “band-aid” I’d feel confident that “band-aids” would not be enough. But there’s something along these lines.)
Yann is implicitly taking the stance that there will not be powerful adversarial pressures exploiting such unforeseen differences in the objective function and humanity’s values.
Maybe, I’m not sure. Regardless of the underlying explanation, if everyone sounded like Yann I’d be more worried. (ETA: Well, really I’d spend a bunch of time evaluating the argument more deeply, and form a new opinion, but assuming I did that and found the argument unconvincing, then I’d be more worried.)
As you take the system and make it vastly superintelligent, your primary focus needs to be on security from adversarial forces, rather than primarily on making something that’s useful.
I agree if you assume a discrete action that simply causes the system to become vastly superintelligent. But we can try not to get to powerful adversarial optimization in the first place; if that never happens then you never need the security. (As a recent example, relaxed adversarial training takes advantage of this fact.) In the previous list, bullet points 3 and 4 are explicitly about avoiding powerful adversarial optimization, and bullet points 1 and 2 are about noticing whether or not we have to worry about powerful adversarial optimization and dealing with it if so. (Meta: Can we get good numbered lists? If we have them, how do I make them?)
Given how difficult security is, it seems better to aim for one of those scenarios. In practice, I do think any plan that involves building powerful AI systems will require some amount of security-like thought—for example, if you’re hoping to detect adversarial optimization to stop it from arising, you need a lot of confidence in the detector. But there isn’t literally strong adversarial optimization working against the detector—it’s more that if there’s a “random” failure, that turns into adversarial optimization, and so it becomes hard to correct the failure. So it seems more accurate to say that we need very low rates of failure—but in the absence of adversarial optimization.
(Btw, this entire comment is predicated on continuous takeoff; if I were convinced of discontinuous takeoff I’d expect my beliefs to change radically, and to be much less optimistic.)
Fwiw, I talk about utility uncertainty because that’s what the mechanical change to the code looks like—instead of having a known reward function, you have a distribution over reward functions. It’s certainly true that this only makes a difference as long as there is still information to be gained about the utility function.
the discussion was whether those agents will be broadly goal-directed at all, a weaker condition than being a utility maximizer
Uh, that chapter was claiming that “being a utility maximizer” is vacuous, and therefore “goal-directed” is a stronger condition than “being a utility maximizer”.
Tbc, in the grandparent I was responding to the specific sentence I quoted, which seems to me to be making a bold claim that I think is false. It’s of course possible that the correct action is still “leave academia”, but for a different reason, like the one you gave.
it should not be ‘joined’ but instead ‘quit, and get to work on an alternative’.
That depends pretty strongly on what the alternative is. Suppose your goal is for more investigation of speculative ideas that may or may not pan out, so that humanity figures out true and useful things about the world. It’s not clear to me that you can do significantly better than current academia, even if you assume that everyone will switch from academia to your new institution.
And of course, people in academia are selected for being good at academic jobs, and may not be good at building institutions. Or they may hate all the politicking that would be required for an alternative. Or they might not particularly care about impact, and just want to do research because it’s fun. All of which are reasons you might join academia rather than quit and work on an alternative, and it’s “morally fine”.