Three Approaches to “Friendliness”
I put “Friendliness” in quotes in the title, because I think what we really want, and what MIRI seems to be working towards, is closer to “optimality”: create an AI that minimizes the expected amount of astronomical waste. In what follows I will continue to use “Friendly AI” to denote such an AI since that’s the established convention.
I’ve often stated my objections MIRI’s plan to build an FAI directly (instead of after human intelligence has been substantially enhanced). But it’s not because, as some have suggested while criticizing MIRI’s FAI work, that we can’t foresee what problems need to be solved. I think it’s because we can largely foresee what kinds of problems need to be solved to build an FAI, but they all look superhumanly difficult, either due to their inherent difficulty, or the lack of opportunity for “trial and error”, or both.
When people say they don’t know what problems need to be solved, they may be mostly talking about “AI safety” rather than “Friendly AI”. If you think in terms of “AI safety” (i.e., making sure some particular AI doesn’t cause a disaster) then that does looks like a problem that depends on what kind of AI people will build. “Friendly AI” on the other hand is really a very different problem, where we’re trying to figure out what kind of AI to build in order to minimize astronomical waste. I suspect this may explain the apparent disagreement, but I’m not sure. I’m hoping that explaining my own position more clearly will help figure out whether there is a real disagreement, and what’s causing it.
The basic issue I see is that there is a large number of serious philosophical problems facing an AI that is meant to take over the universe in order to minimize astronomical waste. The AI needs a full solution to moral philosophy to know which configurations of particles/fields (or perhaps which dynamical processes) are most valuable and which are not. Moral philosophy in turn seems to have dependencies on the philosophy of mind, consciousness, metaphysics, aesthetics, and other areas. The FAI also needs solutions to many problems in decision theory, epistemology, and the philosophy of mathematics, in order to not be stuck with making wrong or suboptimal decisions for eternity. These essentially cover all the major areas of philosophy.
For an FAI builder, there are three ways to deal with the presence of these open philosophical problems, as far as I can see. (There may be other ways for the future to turns out well without the AI builders making any special effort, for example if being philosophical is just a natural attractor for any superintelligence, but I don’t see any way to be confident of this ahead of time.) I’ll name them for convenient reference, but keep in mind that an actual design may use a mixture of approaches.
Normative AI—Solve all of the philosophical problems ahead of time, and code the solutions into the AI.
Black-Box Metaphilosophical AI—Program the AI to use the minds of one or more human philosophers as a black box to help it solve philosophical problems, without the AI builders understanding what “doing philosophy” actually is.
White-Box Metaphilosophical AI—Understand the nature of philosophy well enough to specify “doing philosophy” as an algorithm and code it into the AI.
The problem with Normative AI, besides the obvious inherent difficulty (as evidenced by the slow progress of human philosophers after decades, sometimes centuries of work), is that it requires us to anticipate all of the philosophical problems the AI might encounter in the future, from now until the end of the universe. We can certainly foresee some of these, like the problems associated with agents being copyable, or the AI radically changing its ontology of the world, but what might we be missing?
Black-Box Metaphilosophical AI is also risky, because it’s hard to test/debug something that you don’t understand. Besides that general concern, designs in this category (such as Paul Christiano’s take on indirect normativity) seem to require that the AI achieve superhuman levels of optimizing power before being able to solve its philosophical problems, which seems to mean that a) there’s no way to test them in a safe manner, and b) it’s unclear why such an AI won’t cause disaster in the time period before it achieves philosophical competence.
White-Box Metaphilosophical AI may be the most promising approach. There is no strong empirical evidence that solving metaphilosophy is superhumanly difficult, simply because not many people have attempted to solve it. But I don’t think that a reasonable prior combined with what evidence we do have (i.e., absence of visible progress or clear hints as to how to proceed) gives much hope for optimism either.
To recap, I think we can largely already see what kinds of problems must be solved in order to build a superintelligent AI that will minimize astronomical waste while colonizing the universe, and it looks like they probably can’t be solved correctly with high confidence until humans become significantly smarter than we are now. I think I understand why some people disagree with me (e.g., Eliezer thinks these problems just aren’t that hard, relative to his abilities), but I’m not sure why some others say that we don’t yet know what the problems will be.
- Executable philosophy as a failed totalizing meta-worldview by 4 Sep 2024 22:50 UTC; 88 points) (
- Some Thoughts on Metaphilosophy by 10 Feb 2019 0:28 UTC; 76 points) (
- The Argument from Philosophical Difficulty by 10 Feb 2019 0:28 UTC; 59 points) (
- 23 Jan 2019 2:29 UTC; 42 points) 's comment on Disentangling arguments for the importance of AI safety by (
- 8 Apr 2022 20:11 UTC; 25 points) 's comment on [RETRACTED] It’s time for EA leadership to pull the short-timelines fire alarm. by (
- 31 Jan 2019 9:41 UTC; 21 points) 's comment on Masculine Virtues by (
- 15 Dec 2021 5:21 UTC; 12 points) 's comment on Ngo’s view on alignment difficulty by (
- 3 Aug 2013 16:38 UTC; 8 points) 's comment on How does MIRI Know it Has a Medium Probability of Success? by (
- 9 Feb 2023 3:38 UTC; 6 points) 's comment on Review of AI Alignment Progress by (
- 7 Jun 2017 12:17 UTC; 6 points) 's comment on New circumstances, new values? by (
- 11 Feb 2019 4:32 UTC; 5 points) 's comment on Some Thoughts on Metaphilosophy by (
- 2 Sep 2013 21:17 UTC; 5 points) 's comment on How can we ensure that a Friendly AI team will be sane enough? by (
- 17 Dec 2018 1:25 UTC; 4 points) 's comment on Three AI Safety Related Ideas by (
- 23 Aug 2019 0:19 UTC; 3 points) 's comment on On the purposes of decision theory research by (
- 1 Apr 2024 22:45 UTC; 3 points) 's comment on The Cognitive-Theoretic Model of the Universe: A Partial Summary and Review by (
- 31 Jul 2017 18:47 UTC; 2 points) 's comment on Current thoughts on Paul Christano’s research agenda by (
The difficulty is still largely due to the security problem. Without catastrophic risks (including UFAI and value drift), we could take as much time as necessary and/or go with making people smarter first.
The aspect of FAI that is supposed to solve the security problem is optimization power aimed at correct goals. Optimization power addresses the “external” threats (and ensures progress), and correctness of goals represents “internal” safety. If an AI has sufficient optimization power, the (external) security problem is taken care of, even if the goals are given by a complicated definition that the AI is unable to evaluate at the beginning: it’ll protect the original definition even without knowing what it evaluates to, and aim to evaluate it (for instrumental reasons).
This suggests that a minimal solution is to pack all the remaining difficulties in AI’s goal definition, at which point the only object level problems are to figure out what a sufficiently general notion of “goal” is (decision theory; the aim of this part is to give the goal definition sufficient expressive power, to avoid constraining its decisions while extracting the optimization part), how to build an AI that follows a goal definition and is at least competitive in its optimization power, and how to compose the goal definition. The simplest idea for the goal definition seems to be some kind of WBE-containing program, so learning to engineer stable WBE superorganisms might be relevant for this part (UFAI and value drift will remain a problem, but might be easier to manage in this setting).
(It might be also good to figure out how to pack a reference to the state of the Earth at a recent point in time into the goal definition, so that the AI has an instrumental drive to capture its state when it still doesn’t understand its goals and so will probably use the Earth itself for something else; this might then also lift the requirement of having WBE tech in order to construct the goal definition.)
You appear to be operating under the assumption that it’s already too late or otherwise impractical to “go with making people smarter first”, but I don’t see why, compared to “build FAI first”.
Human cloning or embryo selection look like parallelizable problems that would be easily amenable to the approach of “throwing resources at it”. It just consists of a bunch of basic science and engineering problems, which humans are generally pretty good at, compared to the kind of philosophical problems that need to be solved for building FAI. Nor do we have to get all those problems right on the first try or face existential disaster. Nor is intelligence enhancement known to be strictly harder than building UFAI (i.e., solving FAI requires solving AGI as a subproblem). And there must be many other research directions that could be funded in addition to these two. All it would take is for some government or maybe even large corporation or charitable organization to take the problem of “astronomical waste” seriously (again referring to the more general concept than Bostrom’s, which I wish had its own established name).
If it’s not already too late or impractical to make people smarter first (and nobody has made a case that it is, as far as I know) then FAI work has the counterproductive consequence of making it harder to make people smarter first (by shortening AI timelines). MIRI and other FAI advocates do not seem to have taken this into account adequately.
My point was that when we expand on “black box metaphilosophical AI”, it seems to become much less mysterious than the whole problem, we only need to solve decision theory and powerful optimization and maybe (wait for) WBE. If we can pack a morality/philosophy research team into the goal definition, the solution of the friendliness part can be deferred almost completely to after the current risks are eliminated, at which point the team will have a large amount of time to solve it.
(I agree that building smarter humans is a potentially workable point of intervention. This needs a champion to at least outline the argument, but actually making this happen will be much harder.)
I think I understand the basic motivation for pursuing this approach, but what’s your response to the point I made in the post, that such an AI has to achieve superhuman levels of optimizing power, in order to acquire enough computing power to run the WBE, before it can start producing philosophical solutions, and therefore there’s no way for us to safely test it to make sure that the “black box” would produce sane answers as implemented? It’s hard for me to see how we can get something this complicated right on the first try.
The black box is made of humans and might be tested the usual way when (human-designed) WBE tech is developed. The problem of designing its (long term) social organisation might also be deferred to the box. The point of the box is that it can be made safe from external catastrophic risks, not that it represents any new progress towards FAI.
The AI doesn’t produce philosophical answers, the box does, and the box doesn’t contain novel/dangerous things like AIs. This only requires solving the separate problems of having AI care about evaluating a program, and preparing a program that contains people who would solve the remaining problems (and this part doesn’t involve AI). The AI is something that can potentially be theoretically completely understood and it can be very carefully tested under controlled conditions, to see that it does evaluate simpler black boxes that we also understand. Getting decision theory wrong seems like a more elusive risk.
Ok, I think I misunderstood you earlier, and thought that your idea was similar to Paul Christiano’s, where the FAI would essentially develop the WBE tech instead of us. I had also suggested waiting for WBE tech before building FAI (although due to a somewhat different motivation), and in response someone (maybe Carl Shulman?) argued that brain-inspired AGI or low-fidelity brain emulations would likely be developed before high-fidelity brain emulations, which means the FAI would probably come too late if it waited for WBE. This seems fairly convincing to me.
Waiting for WBE is risky in many ways, but I don’t see a potentially realistic plan that doesn’t go through it, even if we have (somewhat) smarter humans. This path (and many variations, such as a WBE superorg just taking over “manually” and not leaving anyone else with access to physical world) I can vaguely see working, solving the security/coordination problem, if all goes right; other paths seem much more speculative to me (but many are worth trying, given resources; if somehow possible to do reliably, AI-initiated WBE when there is no human-developed WBE would be safer).
It seems fairly clear to me that we will have the ability to “build smarter humans” within the next few decades, just because there are so many different possible research paths that could get us to that goal, all of which look promising.
There’s starting to be some good research done right now on which genes correlate with intelligence. It looks like a very complicated subject, with thousands of genes contributing; nonetheless, =that would be enough to make it possible to do pre-implantation genetic screening to select “smarter” babies with current day technology, and it doesn’t put us that far from actually genetically engineering fertilized eggs before implantation, or possibly even doing genetic therapies to adults (although, of course, that’s inherently dodgier, and is likely to have a smaller effect).
Other likely paths to IA include:
-We’re making a lot of progress on brain-computer interfaces right now, of all types.
-Brain stimulation also seems to have a lot of potential; it was already shown to improve people’s ability to learn math in school in published research.
-Nootropic drugs also may some potential, although we aren’t really throwing a lot of research in that direction right now. It is worth mentioning, though, that one possible outcome to that research on genes correlated with intelligence might be to figure out what proteins those genes code for and figure out drugs that have the same effect.
-Looking at the more cybernetic side, a scientist has recently managed to create an implantable chip that could connect with the brain of a rat and both store memories and give them back to the mouse directly, basically an artificial hippocampus. http://www.technologyreview.com/featuredstory/513681/memory-implants/
-The sudden focus on brain research and modeling in the US and the UK is also likely to have significant impacts
-There’s other, more futuristic possible technologies here as well (nanotech, computer exocortex, ect). Not as likely to happen in the time frame we’re talking about, though.
Anyway, unless GAI comes much sooner then I expect it to, I would expect that some of the things on that list are likely to happen before GAI. Many of them we’re already quite close to, and there’s enough different paths to get to enhanced human intelligence that I put a low probability on all of them being dead ends. I think there’s a very good chance that we’ll develop some kind of way to increase human intelligence first, before any kind of true GAI becomes possible, especially if we put more effort into research in that direction.
The real question, I think, is how much of intelligence boost any of that that going to give us, and if that’s going to be enough to make FAI problems easier to solve, and I’m not sure if that’s answerable at this point.
Astronomical waste is a very specific concept arising from a total utilitarian theory of ethics. That this is “what we really want” seems highly unobvious to me; as someone who leans towards negative utilitarianism, I would personally reject it.
Doesn’t negative utilitarianism present us with the analogous challenge of preventing “astronomical suffering”, which requires an FAI to have solutions to the same philosophical problems mentioned in the post? I guess I was using “astronomical waste” as short for “potentially large amounts of negative value compared to what’s optimal” but if it’s too much associated with total utilitarianism then I’m open to suggestions for a more general term.
I’d be happy with an AI that makes people on Earth better off without eating the rest of the universe, and gives us the option to eat the universe later if we want to...
If the AI doesn’t take over the universe first, how will it prevent Malthusian uploads, burning of the cosmic commons, private hell simulations, and such?
Those things you want to prevent are all caused by humans, so the AI on Earth can directly prevent them. The rest of the universe is only relevant if you think that there are other optimizers out there, or if you want to use it, probably because you are a total utilitarian. But the small chance of another optimizer suggests that anyone would eat the universe.
Cousin_it said “and gives us the option to eat the universe later if we want to...” which I take to mean that the AI would not stop humans from colonizing the universe on their own, which would bring the problems that I mentioned.
On second thought, I agree with Douglas_Knight’s answer. It’s important for the AI to stop people from doing bad things with the universe, but for that the AI just needs to have power over people, not over the whole universe. And since I know about the risks from alien AIs and still don’t want to take over the universe, maybe the CEV of all people won’t want that either. It depends on how many people think population growth is good, and how many people think it’s better to leave most of the universe untouched, and how strongly people believe in these and other related ideas, and which of them will be marked as “wrong” by the AI.
I find your desire to leave the universe “untouched” puzzling. Are you saying that you have a terminal goal to prevent most of the universe from being influenced by human actions, or is it an instrumental value of some sort (for example you want to know what would happen if the universe is allowed to develop naturally)?
Well, it’s not a very strong desire, I suspect that many other people have much stronger “naturalistic” urges than me. But since you ask, I’ll try to introspect anyway:
Curiosity doesn’t seem to be the reason, because I want to leave the universe untouched even after I die. It feels more like altruism. Sometime ago Eliezer wrote about the desire not to be optimized too hard by an outside agent. If I can desire that for myself, then I can also desire it for aliens, give them a chance to not be optimized by us… Of course if there are aliens, we might need to defend ourselves. But something in me doesn’t like the idea of taking over the universe in preemptive self-defense. I’d prefer to find some other way to stay safe...
Sorry if this sounds confusing, I’m confused about it too.
That helps me to understand your position, but it seems unlikely that enough people would desire it strongly enough for CEV to conclude we should give up colonizing the universe altogether. Perhaps some sort of compromised would be reached, for example the FAI would colonize the universe but bypass any solar systems that contain or may evolve intelligent life. Would that be sufficient to satisfy (or mostly satisfy) your desire to not optimize aliens?
Would you like it if aliens colonized the whole universe except our system, or would you prefer if they cared about our wishes and didn’t put us in that situation?
Ok, but are we optimising the expected case or the worst case? If the former, then the probability of those things happening with no special steps against them is relevant. To take the easiest example: would postponing the “take over the universe” step for 300 years make a big difference in the expected amount of cosmic commons burned before takeover?
Depends. Would this allow someone else to move outside its defined sphere of influence and build an AI that doesn’t wait?
If the AI isn’t taking over the universe, that might leave the option open that something else will. If it doesn’t control humanity, chances are that will be another human-originated AI. If it does control humanity, why are we waiting?
For that, it’s sufficient to only take over the Earth and keep an eye on its Malthusian uploads and private hell simulations (but restriction to the Earth seems pointless and hard to implement).
Yes, you could probably broaden the concept to cover negative utilitarianism as well, though Bostrom’s original article specifically defined astronomical waste as being
That said, even if you did redefine the concept in the way that you mentioned, the term “astronomical waste” still implies an emphasis on taking over the universe—which is compatible with negative utilitarianism, but not necessarily every ethical theory. I would suspect that most people’s “folk morality” would say something like “it’s important to fix our current problems, but expanding into space is morally relevant only as far as it affects the primary issues” (with different people differing on what counts as a “primary issue”).
I’m not sure whether you intended the emphasis on space expansion to be there, but if it was incidental, maybe you rather meant something like
?
(I hope to also post a more substantial comment soon, but I need to think about your post a bit more first.)
This is getting a bit far from the original topic, but my personal approach to handling moral uncertainty is inspired by Bostrom and Ord, and works by giving each moral faction a share of my resources and letting them trade to make Pareto improvements. So the “unbounded utility” faction in me was responsible for writing the OP (using its share of my time), and the intended audience is the “unbounded utility” factions in others. That’s why it seems to be assuming unbounded utility and has an emphasis on space expansion, even though I’m far from certain that it represents “correct morality” or my “actual values”.
Of course, this is still just a proxy measure… say that we’re “in a simulation”, or that there are already superintelligences in our environment who won’t let us eat the stars, or something like that—we still want to get as good a bargaining position as we possibly can, or to coordinate with the watchers as well as we possibly can, or in a more fundamental sense we want to not waste any of our potential, which I think is the real driving intuition here. (Further clarifying and expanding on that intuition might be very valuable, both for polemical reasons and for organizing some thoughts on AI strategy.) I cynically suspect that the stars aren’t out there for us to eat, but that we can still gain a lot of leverage over the acausal fanfic-writing commun… er, superintelligence-centered economy/ecology, and so, optimizing the hell out of the AGI that might become an important bargaining piece and/or plot point is still the most important thing for humans to do.
The thing I’ve seen that looks closest to white-box metaphilosophical AI in the existing literature is Eliezer’s causal validity semantics, or more precisely the set of intuitions Eliezer was drawing on to come up with the idea of causal validity semantics. I would recommend reading the section Story of a Blob and the sections on causal validity semantics in Creating Friendly AI. Note that philosophical intuitions are a fuzzily bordered subset of justification-bearing (i.e. both moral/values-like and epistemic) causes that are theoretically formally identifiable and are traditionally thought of as having a coherent, lawful structure.
It seems that we have more morally important potential in some possible worlds than others, and although we don’t want our language to commit us to the view that we only have morally important potential in possible worlds where we can prevent astronomical waste, neither do we want to suggest (as I think “not waste any of our potential” does) the view that we have the same morally important potential everywhere and that we should just minimize the expected fraction of our potential that is wasted. A more neutral way of framing things could be “minimize wasted potential, especially if the potential is astronomical”, leaving the strength of the “especially” to be specified by theories of how much one can affect the world from base reality vs simulations and zoos, theories of how to deal with moral uncertainty, and so on.
I completely understand your intuition but don’t entirely agree; this comment might seem like quibbling: Having access to astronomical resources is one way to have a huge good impact, but I’m not sure we know enough about moral philosophy or even about what an acausal economy/ecology might look like to be sure that the difference between a non-astronomical possible world and an astronomical possible world is a huge difference. (For what it’s worth, my primary intuition here is “the multiverse is more good-decision-theory-limited/insight-limited than resource-limited”. I’d like to expand on this in a blog post or something later.) Obviously we should provisionally assume that the difference is huge, but I can see non-fuzzy lines of reasoning that suggest that the difference might not be much.
Because we might be wrong about the relative utility of non-astronomical possible worlds it seems like when describing our fundamental driving motivations we should choose language that is as agnostic as possible, in order to have a strong conceptual foundation that isn’t too contingent on our provisional best guess models. E.g., take the principle of decision theory that says we should focus more on worlds that plausibly seem much larger even if it might be less probable that we’re in those worlds: the underlying, non-conclusion-contingent reasons that drive us to take considerations and perspectives such as that one into account are the things we should be putting effort into explaining to others and making clear to ourselves.
Agreed. I was being lazy and using “astronomical waste” as a pointer to this more general concept, probably because I was primed by people talking about “astronomical waste” a bunch recently.
Also agreed, but I currently don’t have much to add to what’s already been said on this topic.
Ugh, I found CFAI largely impenetrable when I first read it, and have the same reaction reading it now. Can you try translating the section into “modern” LW language?
CFAI is deprecated for a reason, I can’t read it either.
So after giving this issue some thought: I’m not sure to what extent a white-box metaphilosopical AI will actually be possible.
For instance, consider the Repugnant Conclusion. Derek Parfit considered some dilemmas in population ethics, put together possible solutions at them, and then noted that the solutions led to an outcome which again seemed unacceptable—but also unavoidable. Once his results had become known, a number of other thinkers started considering the problem and trying to find a way way around those results.
Now, why was the Repugnant Conclusion considered unacceptable? For that matter, why were the dilemmas whose solutions led to the RC considered “dilemmas” in the first place? Not because any of them would have violated any logical rules of inference. Rather, we looked at them and thought “no, my morality says that that is wrong”, and then (engaging in motivated cognition) began looking for a consistent way to avoid having to accept the result. In effect, our minds contained dynamics which rejected the RC as a valid result, but that rejection came from our subconscious values, not from any classical reasoning rule that you could implement in an algorithm. Or you could conceivably implement the rule in the algorithm if you had a thorough understanding of our values, but that’s not of much help if the algorithm is supposed to figure out our values.
You can generalize this problem to all kinds of philosophy. In decision theory, we already have an intuitive value of what “winning” means, and are trying to find a way to formalize it in a way that fits our value. In epistemology, we have some standards about the kind of “truth” that we value, and are trying to come up with a system that obeys those standards. Etc.
The root problem is that classification and inference require values. As Watanabe (1974) writes:
“Progress” in philosophy essentially means “finding out more about the kinds of things that we value, drawing such conclusions that our values say are correct and useful”. I am not sure how one could make an AI make progress in philosophy if we didn’t already have a clear understanding of what our values were, so “white-box metaphilosophy” seems to just reduce back to a combination of “normative AI” and “black-box metaphilosophy”.
Coincidentally, I ended up reading Evolutionary Psychology: Controversies, Questions, Prospects, and Limitations today, and noticed that it makes a number of points that could be interpreted in a similar light: in that humans do not really have a “domain-general rationality”, and that instead we have specialized learning and reasoning mechanisms, each of which are carrying out a specific evolutionary purpose and which are specialized for extracting information that’s valuable in light of the evolutionary pressures that (used to) prevail. In other words, each of them carries out inferences that are designed to further some specific evolutionary value that helped contribute to our inclusive fitness.
The paper doesn’t spell out the obvious implication, since that isn’t its topic, but it seems pretty clear to me: since our various learning and reasoning systems are based on furthering specific values, our philosophy has also been generated as a combination of such various value-laden systems, and we can’t expect an AI reasoner to develop a philosophy that we’d approve of unless its reasoning mechanisms also embody the same values.
That said, it does suggest a possible avenue of attack on the metaphilosophy issue… figure out exactly what various learning mechanisms we have and which evolutionary purposes they had, and then use that data to construct learning mechanisms that carry out similar inferences as humans do.
Quotes:
I always suspected that natural kinds depended on an underdetermined choice of properties, but I had no idea there was or could be a theorem saying so. Thanks for pointing this out.
Does a similar point apply to Solomonoff Induction? How does the minimum length of the program necessary to generate a proposition, vary when we vary the properties our descriptive language uses?
Do you have thoughts on the other approaches described here? It seems to me that black box metaphilosophical AI, in your taxonomy, need not be untestable nor dangerous during a transient period.
If I understand correctly, in order for your designs to work, you must first have a question-answerer or predictor that is much more powerful than a human (i.e., can answer much harder questions that a human can). For example, you are assuming that the AI would be able to build a very accurate model of an arbitrary human overseer from sense data and historical responses and predict their “considered judgements”, which is a superhuman ability. My concern is that when you turn on such an AI in order to test it, it might either do nothing useful (i.e., output very low quality answers that give no insights to how safe it would eventually be) because it’s not powerful enough to model the overseer, or FOOM out of control due to a bug in the design or implementation and the amount of computing power it has. (Also, how are you going to stop others from making use of such powerful answers/predictors in a less safe, but more straightforward and “efficient” way?)
With a white-box metaphilosophical AI, if such a thing was possible, you could slowly increase its power and hopefully observe a corresponding increase in the quality of its philosophical output, while fixing any bugs that are detected and knowing that the overall computing power it has is not enough for it to vastly outsmart humans and FOOM out of control. It doesn’t seem to require access to superhuman amounts of computing power just to start to test its safety.
I don’t think that the question-answerer or reinforcement learner needs to be superhuman. I describe them as using human-level abilities rather than superhuman abilities, and it seems like they could also work with subhuman abilities. Concretely, if we imagine applying those designs with a human-level intelligence acting in the interests of a superhuman overseer, they seem (to me) to work fine. I would be interested in problems you see with this use case.
Your objection to the question-answering system seemed to be that the AI may not recognize that human utterances are good evidence about what the overseer would ultimately do (even if they were), and that it might not be possible or easy to teach this. If I’m remembering right and this is still the problem you have in mind, I’m happy to disagree about it in more detail. But it seems that this objection couldn’t really apply to the reinforcement learning approach.
It seems like these systems could be within a small factor of optimal efficiency (certainly within a factor of 2, say, but hopefully much closer). I would consider a large efficiency loss to be failure.
The AI needs to predict what the human overseer “wants” from it, i.e., what answers the human would score highly. If I was playing the role of such an AI, I could use the fact that I am myself a human and thinks similarly to the overseer, and ask myself, “If I was in the overseer’s position, what answers would I judge highly?” In particular, I could use the fact that I likely have philosophical abilities similar to the overseer, and could just apply my native abilities to satisfy the overseer. I do not have to first build a detailed model of the overseer from scratch and then run that model to make predictions. It seems to me that the AI in your design would have to build such a model, and doing so seems a superhuman feat. In other words, if I did not already have native philosophical abilities on par with the overseer, I I couldn’t give answers to any philosophical questions that the overseer would find helpful, unless I had the superhuman ability to create a model of the overseer including his philosophical abilities, from scratch.
Suppose that you are the AI, and the overseer is a superintelligent alien with very different values and philosophical views. How well do you think that things will end up going for the alien? (Assuming you are actually trying to win at the RL / question-answering game.)
It seems to me like you can pursue the aliens’ values nearly as well as if they were your own. So I’m not sure where we disagree (assuming you don’t find this thought experiment convincing):
Do you think that you couldn’t satisfy the alien’s values?
Do you think that there is a disanalogy between your situation in the hypothetical and the situation of a subhuman AI trying to satisfy our values?
Something else?
I think that while my intelligence is not greater than the alien’s, I would probably do the thing that you suggested, “don’t do anything the user would find terrible; acquire resources; make sure the user remains safe and retains effective control over those resource”, but if the aliens were to start to trust me enough to upgrade my cognitive abilities to be above theirs, I could very well end up causing disaster (from their perspective) either by subtly misunderstanding some fine point of their values/philosophical views (*), or by subverting the system through some design or implementation flaw. The point is that my behavior while my abilities are less than super-alien are not a very good indication of how safe I will eventually be.
(*) To expand on this, suppose that as my cognitive abilities increase, I develop increasingly precise models of the alien, and at some point I decide that I can satisfy the alien’s values better by using resources directly instead of letting the alien retain control (i.e., I could act more efficiently this way and I think that my model is as good as the actual alien), but it turns out that I’m wrong about how good my model is, and end up acting on a subtly-but-disastrously wrong version of the alien’s values / philosophical views.
I discuss the most concerning-to-me instance of this in problem (1) here; it seems like that discussion applies equally well to anything that might work fine at first but then break when you become a sufficiently smart reasoner.
I think the basic question is whether you can identify and exploit such flaws at exactly the same time that you recognize their possibility, or whether you can notice them slightly before. By “before” I mean with a version of you that is less clever, has less time to think, has a weaker channel to influence the world, or is treated with more skepticism and caution.
If any of these versions of you can identify the looming problem in advance, and then explain it to the aliens, then they can correct the problem. I don’t know if I’ve ever encountered a possible flaw that wasn’t noticeable “before” it was exploitable in one of these senses. But I may just be overlooking them, and of course even if we can’t think of any it’s not such great reassurance.
Of course even if you can’t identify such flaws, you can preemptively improve the setup for the aliens, in advance of improving your own cognition. So it seems like we never really care about the case where you are radically smarter than the designer of the system, we care about the case where you are very slightly smarter. (Unless this system-improvement is a significant fraction of the difficulty of actually improving your cognition, which seems far-fetched.)
Other than the issue from the first part of this comment, I don’t really see why the behavior changes (in a way that invalidates early testing) when you become super-alien in some respects. It seems like you are focusing on errors you may make that would cause you to receive a low payoff in the RL game. As you become smarter, I expect you to make fewer such errors. I certainly don’t expect you to predictably make more of them.
(I understand that this is a bit subtle, because as you get smarter the problem also may get harder, since your plans will e.g. be subject to more intense scrutiny and to more clever counterproposals. But that doesn’t seem prone to lead to the kinds of errors you discuss.)
Paraphrasing, I think you’re saying that, if the reinforcement game setup continues to work, you expect to make fewer errors as you get smarter. And the only way getting smarter hurts you is if it breaks the game (by enabling you to fall into traps faster than you can notice and avoid them).
Is that right?
I don’t see how this could predictably happen without something else going wrong first. I agree and acknowledged that the question-learning solution is hard-to-test, so let’s focus on the RL approach. (Though I also don’t expect this to happen for the question-answering solution.) In this comment I’ll focus on the misunderstanding case.
So in the future, you expect to predictably make a decision which the aliens would consider catastrophically bad. It seems to me like:
If the solution would really be considered catastrophically bad, and it chosen for evaluation, then it will receive a very low payoff—unless the scheme fails in some other way that we have not yet discussed.
So you would only make such mistakes if you thought that you would receive enough expected benefit from more aggressive decisions that it offsets this predictable possibility of a low payoff from catastrophic error.
But if you took more conservative actions, you could justify those actions (when they were evaluated) by explaining the predicted possibility of a catastrophic outcome. Unless something else has gone wrong, the aliens care more about averting this prospect of a bad outcome than saving time by you being more aggressive, so they shouldn’t penalize you for this.
So if you behave aggressively even at the risk of a catastrophic error, it seems like one of the following must have gone wrong:
In fact the aliens wouldn’t be able to detect a catastrophic error during evaluation.
The conservative policy is actually worse than the aggressive policy in expectation, based on the considered judgment of the aliens.
The aliens wouldn’t accept the justification for conservatism, based on a correct argument that its costs are outweighed by the possibility for error.
This argument is wrong, or else it’s right but you wouldn’t recognize this argument or something like it.
Any of these could happen. 1 and 3 seem like they lead to more straightforward problems with the scheme, so would be worthwhile to explore on other grounds. 2 doesn’t seem likely to me, unless we are dealing with a very minor catastrophe. But I am open to arguing about it. The basic question seems to be how tough it is to ask the aliens enough questions to avoid doing anything terrible.
The examples you give in the parallel thread don’t seem they could present a big problem; you can ask the alien a modest number of questions like “how do you feel about the tradeoff between the world being destroyed and you controlling less of it?” And you can help to the maximum possible extent in answering them. Of course the alien won’t have perfect answers, but their situation seems better than the situation prior to building such an AI, when they were also making such tradeoffs imperfectly (presumably even more imperfectly, unless you are completely unhelpful to the aliens for answering such questions). And there don’t seem to be many plans where the cost of implementing the plan is greater than the cost of consulting the alien about how it feels about possible consequences of that plan.
Of course you can also get this information in other ways (e.g. look at writings and past behavior of the aliens) or ask more open-ended questions like “what are the most likely way things could go wrong, given what I expect to do over the next week,” or pursue compromise solutions that the aliens are unlikely to consider too objectionable.
ETA: actually it’s fine if the catastrophic plan is not evaluated badly, all of the work can be done in the step where the aliens prefer conservative plans to aggressive ones in general, after you explain the possibility of a catastrophic error.
What if this is true, because other aliens (people) have similar AIs, so the aggressive policy is considered better, in a PD-like game theoretic sense, but it would have been better for everyone if nobody had built such AIs?
With any of the black-box designs I’ve seen, I would be very reluctant to push the button that would potentially give it superhuman capabilities, even if we have theoretical reasons to think that it would be safe, and we’ve fixed all the problems we’ve detected while testing at lower levels of computing power. There are too many things that could go wrong with such theoretical reasoning, and easily many more flaws that won’t become apparent until the system becomes smarter. Basically the only reason to do it would be time pressure, due to the AI race or something else. (With other kinds of FAI designs, i.e., normative and white-box metaphilosophical, it seems that we can eventually be more confident about their safety but they are harder to design and implement in the first place, so we should wait for them if we have the option to.) Do you agree with this?
In some sense I agree. If there were no time pressure, then we would want to proceed in only the very safest way possible, which would not involve AI at all. My best guess would be to do a lot of philosphical and strategic thinking as unmodified and essentially unaided humans, perhaps for a very very long time. After that you might decide on a single, maximally inoffensive computional aid, and then repeat. But this seems like quite an alien scenario!
I am not sold that in milder cases you would be much better off with e.g. a normative AI than black box designs. Why is it less error prone? It seems like normative AI must perform well across a wide range of unanticipated environments, to a much greater extent than with black box designs, and with clearer catastrophic consequences for failure. It seems like you would want to do something that remains under the control of something as close to a human as possible, for as long as possible.
In some sense the black box approach is clearly more dangerous (ignoring time limits), since it doesn’t really get you closer to your goal. We will probably have to solve these other problems eventually. The black box metaphilosophical AI is really more like a form of cognitive enhancement. But it seems like enhancement is basically the right thing to do for now, even if we make the time crunch quite a bit milder.
I agree that you need to rely extensively on empirical evidence to gain confidence in your system (though I think this is probably true for other plausible approaches as well).
I don’t fully understand the argument against using such empirical evidence, or rather I don’t see how to make the argument go through without the prospect of a treacherous turn, which we haven’t addressed at all.
I agree that there are kinds of problems which do not arise when testing with weak systems, and these problems cannot be detected empirically until you are working with stronger systems. The main question is whether you can debug these problems while working with stronger systems.
For concreteness, imagine a bug in your system for rendering answers; a malicious answer can compromise the computer and assign the answerer a high reward. Let’s be pessimistic, and assume that the easiest way to compromise the computer is to create a devourer that will be released into the world, that is as smart as the answerer itself, and whose preferences are maximally destructive.
(This is going to repeat ideas from the great-uncle...)
It seems like this possibility is not catastrophic.
--before
The attack can probably be noticed by the only slightly weaker agents which helped you design the current system. The only problem is if the attack cannot be noticed by any agent who isn’t clever enough to realize that such an attack is possible.
You can sanitize the agent’s answers using a similarly intelligent agent with more restricted outputs, e.g. one who just says “possible exploit” or “not a possible exploit.”
The execution environment itself can make extensive use of similarly smart agents, and is therefore less likely to be vulnerable to an attack even if inexpertly designed.
---after
After the attack, the devourer is not much smarter than the other AI’s in the world (for whom, by hypothesis, the control system has been working OK so far). So it doesn’t seem like it should do much damage before being contained.
The basic issue is that agent Z is preceded by nearly-as-smart agent Y, and so the prospects for danger are not that great if agent Z (and only agent Z) fails. The argument for catastrophic risks seems to come from the big disparity in power between the AI being controlled and the human overseer.
The concern seems to rest on the prospect of a treacherous turn: if agent Z is clever enough, then agent Y will also defect rather than helping you contain agent Z, even if in earlier tests agent Y seemed to be benign.
But I don’t see why this would happen for the RL system, and it seems like that’s the crux of the argument. I agree that it would happen for the question-answering system (I was the first to admit that the question-answering system was hard to test).
You may have other kinds of difficulties in mind, but all of the ones that I can think of seem to rest on a treacherous turn or something similar. Is there some other reason to expect failure to be catastrophic?
I’m not pointing out any specific reasons, but just expect that in general, failures when dealing with large amounts of computing power can easily be catastrophic. You have theoretical arguments for why they won’t be, given a specific design, but again I am skeptical of such arguments in general.
I agree there is some risk that cannot be removed with either theoretical arguments or empirical evidence. But why is it greater for this kind of AI than any other, and in particular than white-box metaphilosophical or normative AI?
Normative AI seems like by far the worst, since:
it generally demonstrates a treacherous turn if you make an error,
it must work correctly across a range of unanticipated environments
So in that case we have particular concrete reasons to think that emprical testing won’t be adequate, in addition to the general concern that empirical testing and theoretical argument is never sufficient. To me, white box metaphilosophical AI seems somewhere in between.
(One complaint is that I just haven’t given an especially strong theoretical argument. I agree with that, and I hope that whatever systems people actually use, they are backed by something more convicing. But the current state of the argument seems like it can’t point in any direction other than in favor of black box designs, since we don’t yet have any arguments at all that any other kind of system could work.)
It seems like the question is: “How much more productive is the aggressive policy?”
It looks to me like the answer is “Maybe it’s 1% cheaper or something, though probably less.” In this case, it doesn’t seem like the AI itself is introducing (much of) a PD situation, and the coordination problem can probably be solved.
I don’t know whether you are disagreeing about the likley cost of the aggressive policy, or the consequences of slight productivity advantages for the aggressive policy. I discuss this issue a bit here, a post I wrote a few days ago but just got around to posting.
Of course there may be orthogonal reasons that the AI faces PD-like problems, e.g. it is possible to expand in an undesirably destructive way by building an unrelated and dangerous technology. Then either:
The alien user would want to coordinate in the prisoner’s dilemma. In this case, the AI will coordinate as well (unless it makes an error leading to a lower reward).
The alien user doesn’t want to coordinate in the prisoner’s dilemma. But in this case, the problem isn’t with the AI at all. If the users hadn’t built AI they would have faced the same problem.
I don’t know which of these you have in mind. My guess is you are thinking of (2) if anything, but this doesn’t really seem like an issue to do with AI control. Yes, the AI may have a differential effect on e.g. the availability of destructive tech and our ability to coordinate, and yes, we should try encourage differential progress in AI capabilities just like we want to encourage differential progress in society’s capabilities more broadly. But I don’t see how any solution to the AI control problem is going to address that issue, nor does it seem especially concerning when compared to the AI control problem.
Maybe we have different things in mind for “aggressive policy”. I was think something like “give the AI enough computing power to achieve superhuman intelligence so it can hopefully build a full-fledged FAI for the user” vs the “conservative policy” of “keep the AI at its current level where it seems safe, and find another way to build an FAI”.
A separate but related issue is that it appears such an AI can either be a relatively safe or unsafe AI, depending on the disposition of the overseer (since an overseer less concerned with safety would be more likely to approve of potentially unsafe modifications to the AI). In a sidenote of the linked article, you wrote about why unsafe but more efficient AI projects won’t overtake the safer AI projects in AI research:
But how will the safe projects exclude the unsafe projects from economies of scale and favorable terms of trade, if the unsafe projects are using the same basic design but just have overseers who care more about capability than safety?
Controlling the distribution of AI technology is one way to make someone’s life harder, but it’s not the only way. If we imagine a productivity gap as small as 1%, it seems like it doesn’t take much to close it.
(Disclaimer: this is unusally wild speculation; nothing I say is likely to be true, but hopefully it gives the flavor.)
If unsafe projects perfectly pretend to be safe projects, then they aren’t being more efficient. So it seems like we can assume that they are observably different from safe projects. (For example, there can’t just be complexity-loving humans who oversee projects exactly as if they had normal values; they need to skimp on oversight in order to actually be more eficient. Or else they need to differ in some other way...) If they are observably different, then possible measures include:
Even very small tax rates coupled with redistribution that is even marginally better-directed at safe projects (e.g. that goes to humans)
Regulatory measures to force everyone to incur the overhead, or most of the overhead, of being safe, e.g. lower bounds on human involvement.
Today many trades involve trust and understanding between the parties (e.g. if I go work for you). Probably some trades will retain this character. Honest people may be less happy to trade with those they expect to be malicious. I doubt this would be a huge factor, but 1% seems tiny.
Even in this scenario it may be easy to make technology which is architecturally harder to use by unsafe projects. E.g., it’s not clear whether the end user is the only overseer, or whether some oversight can be retained by law enforcement or the designers or someone else.
Of course unsafe projects can go to greater lengths in order to avoid these issues, for example by moving to friendlier jurisdictions or operating a black market in unsafe technology. But as these measures become more extreme they become increasingly easy to identify. If unsafe jurisdictions and black markets have only a few percent of the population of the world, then it’s easy to see how they could be less efficient.
(I’d also expect e.g. unsafe jurisdictions to quickly cave under international pressure, if the rents they could extract were a fraction of a percent of total productivty. They could easily be paid off, and if they didn’t want to be paid off, they would not be miliratily competitive.)
All of these measures become increasingly implausible at large productivty differentials. And I doubt that any of these particular foreseeable measures will be important. But overall, given that there are economies of scale, I find it very likely that the majority can win. The main question is whether they care enough to.
Normally I am on the other side of a discussion similar to this one, but involving much larger posited productivity gaps and a more confident claim (things are so likely to be OK that it’s not worth worrying about safety). Sorry if you were imagining a very much larger gap, so that this discussion isn’t helpful. And I do agree that there is a real possibility that things won’t be OK, even for small productivity gaps, but I feel like it’s more likely than not to be OK.
Also note that at a 1% gap, we can basically wait it out. If 10% of the world starts out malicious, then by the time the economy has grown 1000x, then 11% of the world is malicious, and it seems implausible that the AI situation won’t change during that time—certainly contemporary thinking about AI will be obsoleted, in an economic period as long as 0-2015AD. (The discussion of social coordination is more important in the case where there are larger efficiency gaps, and hence probably larger differences in how the projects look and what technology they need.)
ETA: Really the situation is not so straightforward, since 1% more productivity leads to more than 1% more profit; overall this issue really seems too complicated for this kind of vague theoretical speculation to be meaningfully accurate, but I hope I’ve given the basic flavor of my thinking.
And finally, I intended 1% as a relatively conservative estimate. I don’t see any particular reason you need to have so much waste, and I wouldn’t be surprised if it end up much lower, if future people end up pursuing some strategy along these lines.
1% seems really low to me. Suppose for example that the AI invents a modification to itself, which is meant to improve its performance. A cautious overseer might demand an explanation of the improvement and why it’s safe, in terms that he can understand, while an incautious overseer might be willing to just approve the modification right away and start using it. It seems to me that the cost of developing an understandable and convincing explanation of the improvement and its safety and then waiting for the overseer to process that, could easily be greater than 1% (or even 100%) of the cost of the inventing the improvement itself.
Also, caution/safety is a matter of degree, and it seems hard to define what “unsafe” means, for the purpose of imposing a penalty on all unsafe projects. (As you said, it would be even safer to not use any machine aid at all until much later. Where/how do you draw the line between “safe enough” and “unsafe”?) It also seems hard to tell which projects are skimping on safety from the outside (you can observe that they are advancing faster than expected given their resource base, but how do you rule out other explanations for that?), while from the inside, the workers “in the know” can probably find all kinds of justifications for doing what they are doing.
Yes, I was about to point this out. In many areas the difference between the best and the second best is already the difference between making a profit and making a loss. This will be even more true for AI, since if AIs can be copied, there won’t be many fields where it makes economic sense to hire a 99% efficient AI instead of a copy of a 100% efficient AI.
Without this issue the “wait it out” solution would seem almost completely satisfactory to me for a gap as small as 1% (or even 5%). Most of my comment was about making unsafe projects less efficient, which (I think) addresses this point.
Also note that 1% productivity benefit only leads to >> 1% efficiency if you are trading extensively with the rest of the world, so that most of your revenue is from sales (and the rest of the world can tell you are the most efficient, or at least that you are selling a lot of stuff cheaply). It’s much easier to regulate or make life harder for someone who is trading extensively with the rest of the world than someone who is operating in secret, so this seems to strengthen the arguments given in the body of my my comment.
The ideas is just to make the relative cost of safety as low as possible. In the linked post I gave a quantification of safety. So let’s consider some 99.9% safe project, and the level of oversight it requires.
If this oversight is expensive because it involves using some resource (like involving human overseers, or periodically pausing as you wait on an overseer, or whatever) then it would be sufficient to require each project to use that resource, or to provide that resource for free to any project using tax dollars, or so on.
Alternatively, if there is some kind of oversight (e.g. periodic high-intensity audits, or AI oversight of AI projects, or mandated close involvement of human auditors) then the goal would just be to ensure that the price of evading detection eats up the efficiency benefits of unsafety. This looks pretty plausible to me, but you may be more skeptical about the feasibility of oversight.
In this proposal:
A cautious overseer demands such an argument with very small probability. I’ll write soon about just how small I think this probability can safely be, I think in the original post I suggested 1%, but I think it could be much lower. If the check is only done with probability 1/10000, then it’s fine if the cost is 10,000% of the cost of implementing the project.
The human overseer has very little role in the early stages of the evaluation, and certainly they don’t have to understand the whole proposal. In my proposal as written she has to relay questions between different AI’s, but of course you could automate that completely, so that the first stages of evaluation are just done by other AI’s (whose evaluations are accountable to other AI’s, whose evaluations are accountable to other AI’s… whose evaluations are accountable to the overseer). At worst, the overseer’s role is similar to the arbirtrator in this scheme, though she has many additional techniques at her disposal.
If the world is moving fast, the main problem is probably the latency of the human evaluation. But there are a bunch of plausible-seeming techniques for getting to low latency. I hope to write about this soon as well.
ETA: What I should have said is that an overseer requests explanations very rarely during normal, intended operation. But this relies on the AI’s ability to find a policy which the overseer will predictably approve of.
Perhaps your point is that a more cautious overseer should request explanations more often, owing to the AI’s limited ability to predict the overseer’s reaction. But if so, we are going around in circles. What is the nature of these improvements, such that whether they are a good idea or not depends in such a detailed way on the values or philosophical views of the overseer? Why can’t one AI build an improved AI which also follows an innocuous policy like “don’t do anything terrible; acquire resources; let the user control those resources”?
Even if the actual biological overseer doesn’t evaluate a proposal, the AI still has to predict what the overseer’s judgement would be, if he or she were to, right? That involves modeling the overseer and the other sub-AIs (each of which is also modeling the overseer and other sub-AIs) and coming up with arguments that can convince the modeled overseer/AIs, etc. I don’t see how that could be only 1% less efficient, compared to an incautious overseer who the AI knows probably won’t demand such arguments or consult lots of sub-AIs and would be willing to try out any improvements that aren’t obviously dangerous.
Separately, the “innocuous” AI you talked about suffers from two efficiency-sapping problems: the AI isn’t sure what the overseer’s values are, and the overseer isn’t sure what his own values are. This forces the AI to avoid certain actions that may be highly efficient but potentially ethically disastrous. For example suppose the overseer isn’t sure whether negative feedback received by reinforcement-based agents has negative moral value, so the AI has to avoid building lots of reinforcement-based subagents. Another example might be that the AI faces a lot of everyday ethical questions in the course of acquiring resources, and has to take the latency hit of asking the overseer about them every time. These efficiency-sapping problems are worse for overseers with more complex ethical views, and more uncertain ethical views.
If there is competition, everyone has strong incentives to quickly build “full-fledged FAIs” which can solve these ethical problems and know exactly what they should and shouldn’t do. People who are less cautious will again have an efficiency advantage while doing this. E.g., they might be fine with building a standard utility-maximizing AI based on a crude model of their current understanding of ethics. I do not see how mandatory oversights or other social techniques can prevent this outcome, if you’re imagining a world where your AI design is being used widely. Someone could make a copy of an existing AI based on your design, change the code or configuration files to make themselves the overseer and remove the mandatory oversights, and then ask the AI to make a “full-fledged FAI” for them, and if they happen to be of the incautious type, this will probably result in the kind of crude normative AI mentioned above (or worse, if they approve a bunch of “improvements” that end up subverting their intentions altogether).
Re paragraph 3: it seems like these are mostly considerations that might strengthen your conclusions if we granted that there was a big productivity difference between my design and a “a standard utility-maximizing AI based on a crude model of their current understanding of ethics.” But I would already be happy to classify a large productivity loss as a failure, so let’s just concentrate on the claimed productivity loss.
These incentives only operate if there is a big productivity difference.
Beyond that, if the kinds of issues peope run into are “the AI faces a lot of everyday ethical questions in the course of acquiring resources,” then it really seems like what you need is a not-catastrophically-wrong model of human morality, which would probably just be built in mundane ways. I don’t see a strong argument that this would require lots of impresive conceptual progress, rather than being simiar-in-kind to building a predictive model of anything else. But I suspect this is just a reflection of the disagreement about paragraph 2, which we should flesh out in the sibling.
This doesn’t seem like a very general argument against the possibility of mandatory oversight or technological handicapping, and the measures you describe seem like strawmen. I agree that whatever kind of oversight you employ, it will be possible to subvert it, whatever tax you charge it will be possible to evade it, and so on. But doing so will often come with a cost (as it does today), and it just doesn’t seem that hard to get it up to a 1% loss (say). We could talk more about the particular measures that could be taken for oversight; I’m sure we can both imagine many regulatory and technological approaches that would be more annoying to sidestep than an entry in a configuration file, but I suspect our disagreement comes form us imagining different productivity gaps.
The prospect of someone designing their own AI, which is very architecturally different from the rest of the world, just doesn’t seem especially troubling, unless you imagine that the rest of the world is using a significantly handicapped design. (See the first sentence of this reply.)
Re paragraph 1:
I don’t think this is necessary. Let’s go back to the alien analogy. Let’s assume that you have an action that effectively implements U = “acquire resources; don’t be evil; leave the user in control.” I claimed that pursuing U is not much harder than acquiring resources on its own; for the moment can we grant that, since it seems orthogonal to the issues in this paragraph?
So you have this good action, and you are deciding what to do. I claim that you should just do it, with no extra computation. Then, if it gets evaluated, you will have to think about rhetorical strategies to use when convincing the overseer its a good idea. Also, a copy of you will have to think about how to convince the overseer that it’s a bad idea. And the overseer will need to actually think about the issue. And so on.
There are some issues that you’ll need to think about before you commit to the action. For example, if there is a knockdown argument that this action is a bad idea, then you shouldn’t do it. But that thinking seems to be a necessary part of maximizing U, not an extra step (unless the arguing system is broken).
You might be able to do even better by thinking through all of the arguments in advance, and taking an action which is not merely good but which will fare especially well during an evaluation. But if the extra computational costs of doing so are significant, this has to be weighed against the benefits of actually finding a better plan. It seems like this basically comes down to the question “If you take the very most efficient plan, and we perform some elaborate procedure for arguing, can you actually ‘convince’ the overseer that it’s a good plan?”
If the answer is “no” then you are screwed anyway, before we even start thinking about computational overhead—you simply can’t take the best plan, because you wouldn’t be able to receive a good rating if you do. And if the answer is “yes” then the problem seems to go away. The justifiability of good plans does seem to be a key question about the feasibility of this scheme, which I am happy to discuss if it’s underlying objection you have in mind.
Re paragraph 2:
If the most efficient actions are ethically disastrous, then we have a fundamental problem which seems to me to be unrelated to AI safety, and which no AI control proposal will address. For example, if the most efficient strategy is to build a virus that kills everyone but you, and an AI is considering this strategy but has to reject it because it is unethical, then we are just out of luck. We could call this a problem with “AI,” but it’s really a problem with biotechnology.
If a certain kind of reinforcement learning is especially efficient but morally unacceptable, then that seems to be the same situation. What are we supposed to do, other than either accept the moral cost or adopt a good enough social solution to overcome the efficiency gap? What kind of solution might you hope to find that would make this kind of problem go away?
If the efficient actions merely might be ethically disastrous, then I guess the cost is supposed to be the time required to clarify the overseer’s values. Which brings us to:
The question is just how many distinct questions of this form there are, and how important they are to the AI’s plans. If there were merely a billion such questions it doesn’t seem like a big deal at all (though then a significant occupation of humans would be answering moral questions).
Even that strikes me as completely implausible given our experience so far (combined with my inability to see many future examples). If I were the user, and people were trying to optimize values using the range of policies available today, then it seems like they would have had to ask me no more than a dozen or so questions to get things basically right (i.e. realizing much more than 99% of the potential value from my perspective). So this seems to require moral problems to proliferate at a much faster rate than technological problems.
Do you disagree about the importance of hard ethical questions in the situation today (e.g. I am implicitly overlooking many important issues because I’m not used to dealing with an AI), or do you just expect more proliferation in the future?
Also, the problem of predicting human moral judgments doesn’t seem to be radically harder than the problem of e.g. negotiating with humans. I guess this is just another angle on “how many distinct moral questions do you have to answer?” since the real question is how much you can generalize from each answer. I don’t feel like there are that many hard-to-predict parameters before everything reduces to easy-to-predict consequences.
Your examples of possible mistakes seemed to involve not knowing how the alien would feel about particular tradeoffs. This doesn’t seem related to how much computational power you have, except insofar as having more power might lead you to believe that it is safe to try and figure out what the alien thinks from first principles. But that’s not a necessary consequence of having more computing power, and I gave an argument that more computing power shouldn’t predictably lead to trouble.
Why do you think that more computing power requires a strategy which is “aggressive” in the sense of having a higher probability of catastrophic failure?
You might expect that building “full-fledged FAI” requires knowing a lot about the alien, and you won’t be able to figure all of that out in advance of building it. But again, I don’t understand why you can’t build an AI that implements a conservative strategy, in the sense of being quick to consult the user and unlikley to make a catastrophic error. So it seems like this just begs the question about the relative efficacy of conservative vs. aggressive strategies.
I don’t quite understand the juxtaposition to the white box metaphilosophical algorithm. If we could make a simple algorithm which exhibited weak philosphical ability, can’t the RL learner also use such a simple algorithm to find weak philosophical answers (which will in turn receive a reasonable payoff from us)?
Is the idea that by writing the white box algorithm we are providing key insights about what metaphilosphy is, that an AI can’t extract from a discussion with us or inspection of our philosphical reasoning? At a minimum it seems like we could teach such an AI how to do philosphy, and this would be no harder than writing an algorithm (I grant that it may not be much easier).
It seems to me that we need to understand metaphilosphy well enough to be able to write down a white-box algorithm for it, before we can be reasonably confident that the AI will correctly solve every philosophical problem that it eventually comes across. If we just teach an AI how to do philosophy without an explicit understanding of it in the form of an algorithm, how do we know that the AI has fully learned it (and not some subtly wrong version of doing philosophy)?
Once we are able to write down a white-box algorithm, wouldn’t it be safer to implement, test, and debug the algorithm directly as part of an AI designed from the start to take advantage of the algorithm, rather than indirectly having an AI learn it (and then presumably verifying that its internal representation of the algorithm is correct and there aren’t any potentially bad interactions with the rest of the AI)? And even the latter could reasonably be called white-box also since you are actually looking inside the AI and making sure that it has the right stuff inside. I was mainly arguing against a purely black box approach, where we start to build AIs while having little understanding of metaphilosophy, and therefore can’t look inside the AI to see if has learned the right thing.
I don’t think this is core to our disagreement, but I don’t understand why philosophical questions are especially relevant here.
For example, it seems like a relatively weak AI can recognize that “don’t do anything the user would find terrible; acquire resources; make sure the user remains safe and retains effective control over those resource” is a praise-winning strategy, and then do it. (Especially in the reinforcement learning setting, where we can just tell it things and it can learn that doing things we tell it is a praise-winning strategy.) This strategy also seems close to maximally efficient—the costs of keeping humans around and retaining the ability to consult them are not very large, and the cost of eliciting the needed information is not very high.
So it seems to me that we should be thinking about the AI’s ability to identify and execute strategies like this (and our ability to test that it is correctly executing such strategies).
I discussed this issue a bit in problems #2 and #3 here. It seems like “answers to philosophical questions” can essentially be lumped under “values,” in that discussion, since the techniques for coping with unknown values also seem to cope with unknown answers to philosophical questions.
ETA: my position looks superficially like a common argument that people give for why smart AI wouldn’t be dangerous. But now the tables are turned—there is a strategy that the AI can follow which will cause it to earn high reward, and I am claiming that a very intelligent AI can find it, for example by understanding the intent of human language and using this as a clue about what humans will and won’t approve of.
Acquiring resources has a lot of ethical implications. If you’re inventing new technologies and selling them, you could be increasing existential risk. If you’re trading with others, you would be enriching one group at the expense of another. If you’re extracting natural resources, there’s questions of fairness (how hard should you drive bargains or attempt to burn commons) and time preference (do you want to maximize short term or long term resource extraction). And how much do you care about animal suffering, or the world remaining “natural”? I guess the AI could present a plan that involves asking the overseer to answer these questions, but the overseer probably doesn’t have the answers either (or at least should not be confident of his or her answers).
What we want is to develop an AI that can eventually do philosophy and answer these questions on its own, and correctly. It’s the “doing philosophy correctly on its own” part that I do not see how to test for in a black-box design, without giving the AI so much power that it can escape human control if something goes wrong. The AI’s behavior, while it’s in the not-yet-superintelligent, “ask the overseer about every ethical question” phase, doesn’t seem to tell us much about how good the design and implementation is, metaphilosophically.
Google Maps answers better than a human “how do I get from point A to point B”. I don’t think it does nothing useful just because it’s not powerful enough to model the overseer.
I think you misunderstood. My comments were meant to apply to Paul Christiano’s specific proposals, not to AIs in general.
Does a sped-up uploaded mind count as a kind of black-box metaphysical AI?
On the other hand, to the extent that our uncertainty about whether different BBMAI designs do philosophy correctly is independent, we can build multiple ones and see what outputs they agree on. (Or a design could do this internally, achieving the same effect.)
This seems to be an argument for building a hybrid of what you call metaphilosophical and normative AIs, where the normative part “only” needs to be reliable enough to prevent initial disaster, and the metaphilosophical part can take over afterward.
I prefer the more cheerfully phrased “Converts the reachable universe to QALYs” but same essential principle.
Modulo complexity of value, I hope? I don’t think we’re in a position to pinpoint QALY’s as The Thing to Tile.
Hence the quality-adjusted part. :)
Presumably, the complexity of value could go into the specifics of the quality-adjustment.
Except that “Life Years” assumes certain linearities that I think should not be assumed.
Perhaps you could make a taxonomy like yours when talking about a formally-defined singleton, which we might expect society to develop eventually. But I haven’t seen strong arguments that we would need to design such a singleton starting from anything like our current state of knowledge. The best argument reason I know that we might need to solve this problem soon is the possibility of a fast takeoff, which still seems reasonably unlikely (say < 10% probability) but is certainly worth thinking about more carefully in advance.
But even granting a fast takeoff, it seems quite likely that you can build AI’s that “work around” this problem in other ways, particularly by remaining controlled by human owners or by quickly bootstrapping to a better prepared society. I don’t generally see why this would be subject to the same extreme difficulty you describe (the main reason for optimism is our current ignorance about what the situation will look like, and the large number of possible schemes).
And finally, even granting that we need to design such a singleton today (because of fast-takeoff and no realistic prospects for remaining in control), I don’t think the taxonomy you offer is exhaustive, and I don’t buy the claims of extraordinary difficulty.
There is a broad class of proposals in which an AI has a model of “what I would want” (either by the route that ordinary AI researchers find plausible, in which AI’s concepts are reasonably aligned with human concepts, or by more elaborate formal machinations as in my indirect normativity proposal or a more sophisticated version thereof). It doesn’t seem to be the case that you can’t test such designs until you are dealing with superhuman AI—normal humans can reason about such concepts, as we do, and as long as you can design any AI which uses such concepts and isn’t deliberately deceptive you can see whether it is doing something sensible. And it doesn’t seem to be the case that your concept has to be so robust that it can tile the whole universe, because what you would want involves opportunities for explicit reflection by humans. The more fundamental issue is just that we haven’t thought about this much, and so without formal justification I am pretty dubious of any claimed taxonomy or fundamental difficulty.
I agree that there is a solid case for making people smarter. I think there are better indirect approaches to making the world better though (rather than directly launching in on human enhancement). In the rest of the world (and even the rest of the EA community) people are focusing on making the world better in this kind of broad way. And in fairness this work currently occupies the majority of my time. I think do it’s reasonably likely that I should focus directly on AI impacts, and that thinking about AI more clearly is the first step, but this is mostly coming from the neglected possibility of human-level AI relatively soon (e.g. < 40 years).
When you say “fast takeoff” do you mean the speed of the takeoff (how long it takes from start to superintelligence) or the timing of it (how far away it is from now)? Because later on you mention “< 40 years” which makes me think you mean that here as well, and timing would also make more sense in the context of your argument, but then I don’t understanding why you would give < 10% probability for takeoff in the next 40 years.
Superintelligent AIs controlled by human owners, even if it’s possible, seem like a terrible idea, because humans aren’t smart or wise enough to handle such power without hurting themselves. I wouldn’t even trust myself to control such an AI, much less a more typical, less reflective human.
Not sure what you mean by this. Can you expand?
Regarding your parenthetical “because of”, I think the “need” to design such a singleton comes from the present opportunity to build such a singleton, which may not last. For example, suppose your scenario of superintelligent AIs controlled by human owners become reality (putting aside my previous objection). At that time we can no longer directly build a singleton, and those AI/human systems may not be able to, or want to, merge into a singleton. They may instead just spread out into the universe in an out of control manner, burning the cosmic commons as they go.
There are all kinds of ways for this to go badly wrong, which have been extensively discussed by Eliezer and others on LW. To summarize, the basic problem is that human concepts are too fuzzy and semantically dependent on how human cognition works. Given complexity and fragility of value and likely alien nature of AI cognition, it’s unlikely that AIs will share our concepts closely enough for it to obtain a sufficiently accurate model of “what I would want” through this method. (ETA: Here is a particularly relevant post by Eliezer.)
My claim about not being able to test was limited to the black-box metaphilosophical AI, so doesn’t apply here, which instead has other problems, mentioned above.
Since you seem to bring up ideas that others have already considered and rejected, I wonder if perhaps you’re underestimating how much we’ve thought about this? (Or were you already aware of their rejection and just wanted to indicate your disagreement?)
This is quite possible. I’m not arguing that directly pushing for human enhancement is the best current invention, just that it ought to be done at some point, prior to trying to build FAI.
I mean speed. It seems like you are relying on an assumption of a rapid transition from a world like ours to a world dominated by superhuman AI, whereas typically I imagine a transition that lasts at least years (which is still very fast!) during which we can experiment with things, develop new approaches, etc. In this regime many more approaches are on the table.
Even given shaky solutions to the control problem, it’s not obvious that you can’t move quickly to a much better prepared society, via better solutions to the control problem, further AI work, brian emulations, significantly better coordination or human enhancement, etc.
This is an interesting view (in that it isn’t what I expected). I don’t think that the AIs are doing any work in this scenario, i.e., if we just imagined normal humans going on their way without any prospect of building much smarter descendants, you would make similar predictions for similar reasons? If so, this seems unlikely given the great range of possible coordination mechanisms many of which look like they could avert this problem, the robust historical trends in increasing coordination ability and scale of organization, etc. Are there countervailing reasons to think it is likely, or even very plausible? If not, I’m curious about how the presence of AI changes the scenario.
I don’t find these arguments particularly compelling as a case for “there is very likely to be a problem,” though they are more compelling as an indication of “there might be a problem.”
Fragility and complexity of value doesn’t seem very relevant. The argument is never that you can specify value directly. Instead we are saying that you can capture concepts about respecting intentions, offering further opportunities for reflection, etc. (or in the most extreme case, concepts about what we would want upon reflection). These concepts are also fragile, which is why there is something to discuss here.
There are many concepts that seem useful (and perhaps sufficient) which seem to be more robust and not obviously contingent on human cognition, such as deference, minimal influence, intentions, etc. In particular, we might expect that we can formulate concepts in such a way that they are unambiguous in our current environment, and then maintain them. Whether you can get access to those concepts, or use them in a useful enough way, is again not clear.
The arguments given there (and elsewhere) just don’t consider most of the things you would actually do, even the ones we can currently foresee. This is a special case of the next point. For example, if an agent is relatively risk averse, and entertains uncertainty about what is “good,” then it may tend to pick a central example from the concept of good instead of an extreme one (depending on details of the specification, but it is easy to come up with specifications that do this). So saying “you always get extreme examples of a concept when you use it as a value for a goal-seeking agent” is an interesting observation and a cause for concern, but it is so far from a tight argument that I don’t even think of it as trying.
All of the arguments here are extremely vague (on both sides). Again, this is fine if we want to claim “there may be a problem.” Indeed, I would even agree that any particular proposal is very unlikely to work, and any class of proposals is pretty unlikely to work, etc. (I would say the same thing about approaches to AI itself). But it seems like it doesn’t entitle “there is definitely a problem,” especially to the extent that we are relying on the conjunction of many claims of the form “This won’t look robustly viable once we know more.”
In general, it seems that the burden of proof is on someone who claims “Surely X” in an environment which is radically unlike any environment we have encountered before. I don’t think that any very compelling arguments have been offered here, just vague gesturing. I think it’s possible that we should focus on some of these pessimistic possibilities because we can have a larger impact there. But your (and Eliezer’s) claims go further than this, suggesting that it isn’t worth investing in interventions that would modestly improve our ability of coping with difficulties (respectively clarifying understanding of AI and human empowerment, both of which slightly speed up AI progress), because the probability is so low. I think this is a plausible view, but it doesn’t look like the evidence supports it to me.
I’m certainly aware of the points you’ve raised, and at least a reasonable fraction of the thinking that has been done in this community on these topics. Again, I’m happy with these arguments (and have made many of them myself) as a good indication that the issue is worth taking seriously. But I think you are taking this “rejection” much too seriously in this context. If someone said “maybe X will work” and someone else said “maybe X won’t work,” I won’t then leave X off of (long) lists of reasons why things might work, even if I agreed with them.
This is getting a bit too long for a point-by-point response, so I’ll pick what I think are the most productive points to make. Let me know if there’s anything in particular you’d like a response on.
I try not to assume this, but quite possibly I’m being unconsciously biased in that direction. If you see any place where I seem to be implicitly assuming this, please point it out, but I think my argument applies even if the transition takes years instead of weeks.
Coordination ability may be increasing but is still very low on an absolute scale. (For example we haven’t achieved nuclear disarmament, which seems like a vastly easier coordination problem.) I don’t see it increasing at a fast enough pace to be able to solve the problem in time. I also think there are arguments in economics (asymmetric information, public choice theory, principal-agent problems) that suggest theoretical limits to how effective coordination mechanisms can be.
For each AI approach there is not a large number of classes of “AI control schemes” that are compatible or applicable to it, so I don’t understand your relative optimism if you think any given class of proposals is pretty unlikely to work.
But the bigger problem for me is that even if one of these proposals “works”, I still don’t see how that helps towards the goal of ending up with a superintelligent singleton that shares our values and is capable of solving philosophical problems, which I think is necessary to get the best outcome in the long run. An AI that respects my intentions might be “safe” in the immediate sense, but if everyone else has got one, we now have less time to solve philosophy/metaphilosophy before the window of opportunity for building a singleton closes.
(Quoting from a parallel email discussion which we might as well continue here.) My point is that the development of such an AI leaves people like me in a worse position than before. Yes I would ask for “more robust solutions to the control problem” but unless the solutions are on the path to solving philosophy/metaphilosophy, they are only ameliorating the damage and not contributing to the ultimate goal, and while I do want “opportunities for further reflection”, the AI isn’t going to give me more than what I already had before. In the mean time, other people who are less reflective than me are using their AIs to develop nanotech and more powerful AIs, likely forcing me to do the same (before I’d otherwise prefer) in order to remain competitive.
Just a minor terminology quibble: the “black” in “black-box” does not refer to the color, but to the opacity of the box; i.e., we don’t know what’s inside. “White-box” isn’t an obvious antonym in the sense I think you want.
“Clear-box” would better reflect the distinction that what’s inside isn’t unknown (i.e., it’s visible and understandable). Or perhaps open-box might be even better, since not only we know how it works but also we put it there.
White-box is, nevertheless, the accepted name for the concept he was referring to—probably as an antonym to black-box.
English. What can you do.
Huh. I’ve never encountered it, and I would have bet ten to one that if it existed I’d have seen it. Time to check some of those priors...
Thanks for letting me know.
I actually checked Wikipedia before using the term, since I had the same thought as you, but “white-box testing” seems to be the most popular term (it’s the title of the article and used throughout) in preference to “clear box testing” and a bunch of others that are in parenthesis under “also known as”.
Right, sorry. I was so sure that I’d have heard the term before if it existed, and that you invented the term yourself, it never occurred to me to check. Well, you learn something new every day :)
Just to be clear, you are proposing that mere friendliness is insufficient, and we also want optimality with respect to getting as much of the cosmos as we can? This seems contained in friendliness, but OK. You are not proposing that optimally taking over the universe is sufficient for friendliness, right?
I’ve been thinking a lot about this, and I also think this is most likely to work. On general principle, understanding the problem and indirectly solving it is more promising than trying to solve the problem directly without knowing what it is. If we do the directly normative or black-box approach without knowing what the problem is, how do we even know that it is solved.
I would amend, though, that just the nature of metaphilosophy is not going to be enough and there will have to be a certain level of blackboxiness in that certain questions (ie what is good) are only answerable with respect to human brains. I’m unsure if this falls under what you mean by blackbox approaches, though.
i just want to make clear that white-box is not coming up with some simple theory of how all of everything can be derived from pure reason, and more like a relatively simple theory of how the structure of the human brain, human culture, physics, logic, and so on relate to the answer to the philosophical questions.
More generally, there is a spectrum of how meta you go on what the problem is:
Directly going about life is the lowest level
Thinking about what you want and being a bit strategic
Realizing that you want a powerful agent acting in your interest, and building an AI to solve that problem. (Your normative AI)
explicitly modelling the reasoning you did to come up with parts of your AI, except doing it better with less constraints (your black box metaphilosophy)
explicitly asking “what general problems are we solving when doing philosophy, and how would we do that in general?”, and building that process (your white-box metaphilosophy)
Something even further?
This spectrum generalizes to other problems, for example I use it a lot in my engineering work. “What problem are we solving?” is an extremely valuable question to answer, IMO.
This relates to the expert-at/expert-on dichotomy and discussing problems before proposing solutions and so on.
Anyways, I think it is a mostly smooth spectrum of increasing understanding of what exactly the problem is, and we want to position ourselves optimally on that, rather than a trichotomy.
For going to the grocery store, going full meta is probably useless, but for FAI, I agree that we should go as far as we can towards a full understanding of “what problem are we solving, and how would we get a solution, in principle?”. And then get the AI to do the actual solving, because the details are likely beyond human ability.
Meh. If we can get a safe AI, we’ve essentially done the whole of the work. Optimality can be tacked on easily at that point, bearing in mind that what may seem optimal to some may be an utter hellish disaster to others (see Repugnant Conclusion), so some sort of balanced view of optimality will be needed.
I’m not seeing this. Suppose we’ve got a Oracle AI that’s been safely boxed, which we can use to help us solve various technical problems. How do we get to optimality from there, before others people take our Oracle AI technology and start doing unsafe things with it? I’ve argued, in this post, that getting to optimality requires solving many hard philosophical problems, and it doesn’t seem like having an AI that’s merely “safe” helps much with that.
Sure, no argument there.
To refine both of our ideas: I was thinking that safety for an autonomous or unleashed AI was practically the same thing as optimality.
But I agree that there may be systems of containments that could make certain AI designs safe, without needing optimality.
How is that defined? I would expect that minimizing astronomical waste would be the same as maximizing the amount used for intrinsically valuable things, which would be the same as maximizing utility.
Human intelligence is getting more substantially enhanced all the time. No doubt all parties will use the tools available—increasingly including computer-augmented minds as time passes.
So: I’m not clear about where it says that this is their plan.