If the super-powerful SAT solver thing finds the plans but doesn’t execute them, would you still lump it with optimizer_2? (I know it’s just terminology and there’s no right answer, but I’m just curious about what categories you find natural.)
(BTW this is more-or-less a description of my current Grand Vision For AGI Safety, where the “dynamics of the world” are discovered by self-supervised learning, and the search process (and much else) is TBD.)
This is great, thanks again for your time and thoughtful commentary!
RE “I’m not entirely convinced that predictions should be made in a way that’s completely divorced from their effects on the world.”: My vision is to make a non-agential question-answering AGI, thus avoiding value alignment. I don’t claim that this is definitely the One Right Answer To AGI Safety (see “4. Give up, and just make an agent with value-aligned goals” in the post), but I think it is a plausible (and neglected) candidate answer. See also my post In defense of oracle (tool) AI research for why I think it would solve the AGI safety problem.
If an AGI applies its intelligence and world model to its own output, choosing that output partly for its downstream effects as predicted by the model, then I say it’s a goal-seeking agent. In this case, we need to solve value alignment—even if the goal is as simple as “answer my question”. (We would need to make sure that the goal is what it’s supposed to be, as opposed to a proxy goal, or a weird alien interpretation where rewiring the operator’s brain counts as “answer my question”.) Again, I’m not opposed to building agents after solving value alignment, but we haven’t solved value alignment yet, and thus it’s worth exploring the other option: build a non-agent which does not intelligently model the downstream effects of its output at all (or if it does model it incidentally, to not do anything with that information).
Interfacing with a non-agential AGI is generally awkward. You can’t directly ask it to do things, or to find a better way to communicate. My proposal here is to ask questions like “If there were no AGIs in the world, what’s the likeliest way that a person would find a cure for Alzheimer’s?” This type of question does not require the AGI to think through the consequence of its output, and it also has other nice properties (it should give less weird and alien and human-unfriendly answers than the solutions a direct goal-seeking agent would find).
OK, that’s my grand vision and motivation, and why I’m hoping for “no reasoning about the consequences of one’s output whatsoever”, as opposed to finding self-fulfilling predictions. (Maybe very very mild optimization for the consequences of one’s outputs is OK, but I’m nervous.)
Your other question was: if a system is making manipulative predictions, towards what goal is it manipulating? Well, you noticed correctly, I’m not sure, and I keep changing my mind. And it may also be different answers depending on the algorithm details.
My top expectation is that it will manipulate towards getting further inputs that its model thinks are typical, high-probability inputs. If X implies Y, and P(Y) is low, that might sometimes spuriously push down P(X), and thus the system will pick those X’s that result in high P(Y).
My secondary expectation is that it might manipulate towards unambiguous, low-entropy outputs. This is the expectation if the system picks out the single most likely ongoing long-term context, and output a prediction contingent on that. (If instead the system randomly draws from the probability distribution of all possible contexts, this wouldn’t happen, as suggested by interstice’s comments on this page.) So if X1 leads to one of 500 slightly different Y1′s (Y1a, Y1b,...), while X2 definitely leads to only one specific Y2, then Y2 is probably the most likely single Y, even if all the Y1′s in aggregate are likelier than Y2; so X2 is at an unfair advantage.
Beyond those two, I suspect there can be other goals but they depend on the algorithm and its heuristics.
OK, hmm, let me try again then. This would be the section of the post entitled “A self-supervised learning algorithm in an interactive environment can become a manipulative goal-seeker”.
I’ve been assuming all along that the objective function only rewards the next word. Unfortunately, it seems that the way to achieve this objective in practice is to search for higher-level longer-term contexts that surround the next word, like when we’re watching TV and we think, “A commercial break is starting.” Knowing that a commercial break is starting is essential for predicting the very next frame on the TV screen, but it is also incidentally a (implicit) prediction about what will appear on the screen for the next few minutes. In other words, you could say that making accurate (possibly implicit) probabilistic predictions about the next many words is instrumentally useful for making accurate probabilistic predictions about the next one word, and is thus rewarded by the objective function. I expect that systems that work well will have to be designed this way (i.e. finding “contexts” that entail implicit predictions about many future words, as a step towards picking the single next word). I think this kind of thing is necessary to implement even very basic things like object permanence.
Then the next step is to suppose that the system (being highly intelligent) comes to believe that the prediction X will cause other aspects of the longer-term context to be Y. (See the “Hypothesis 1” vs “Hypothesis 2″ examples in the post.) If the system was previously thinking that P(X) is high and P(Y) is low, then ideally, the realization that X implies Y will cause the system to raise P(Y), while keeping P(X) at its previous value. This is, after all, the logically correct update, based on the direction of causality!
But if the system screws up, and lowers P(X) instead of raising P(Y), then it will make a manipulative prediction—the output is being chosen partially for its downstream interactive effects. (Not all manipulative predictions are dangerous, and there might be limits to how strongly it optimizes its outputs for their downstream effects, but I suspect that this particular case can indeed lead to catastrophic outcomes, just like we generically expect from AIs with real-world human-misaligned goals.)
Why should the system screw up this way? Just because the system’s causal models will sometimes have mistakes, and sometimes have uncertainties or blank spaces (statistical-regularities-of-unknown-cause), and also because humans make this type of mistake all the time (“One man’s modus ponens is another man’s modus tollens”). I suspect it will make the right update more often than chance, I just don’t see how we can guarantee that it will never make the wrong update in the manipulative Y-->X direction.
Does that help?
Well, strategy 1 is “Keep it from thinking that it’s in an interactive environment”. Things like “don’t adjust the weights of the network while we ask questions” is a way to prevent it from thinking that it’s in an interactive environment based on first-hand experience—we’re engineering the experience to not leave traces in its knowledge. But to succeed in strategy 1, we also need to make sure that it doesn’t come to believe it’s in an interactive environment by other means besides first-hand experience, namely by abstract reasoning. More details in this comment, but basically an AGI with introspective information and world-knowledge will naturally over time figure out that it’s an AGI, and to figure out the sorts of environments that AGIs are typically in, and thus to hypothesize the existence of interactions even if those interactions have never happened before, and were not intended by the designer (e.g. the “Help I’m trapped in a GPU!” type interactions).
Yeah, I think something like that would probably work for 1B, but 1B is the easy part. It’s 1C & 1D that are keeping me up at night...
Regarding “computational threshold”, my working assumption is that any given capability X is either (1) always and forever out of reach of a system by design, or (2) completely useless, or (3) very likely to be learned by a system, if the system has long-term real-world goals. Maybe it takes some computational time and effort to learn it, but AIs are not lazy (unless we program them to be). AIs are just systems that make good decisions in pursuit of a goal, and if “acquiring capability X” is instrumentally helpful towards achieving goals in the world, it will probably make that decision if it can (cf. “Instrumental convergence”).
If I have a life goal that is best accomplished by learning to use a forklift, I’ll learn to use a forklift, right? Maybe I won’t be very fluid at it, but fine, I’ll operate it more slowly and deliberately, or design a forklift autopilot subsystem, or whatever...
Right, I was using “output” in a broad sense of “any way that the system can causally impact the rest of the world”. We can divide that into “intended output channels” (text on a screen etc.) and “unintended output channels” (sending out radio signals using RAM etc.). I’m familiar with a small amount of work on avoiding unintended output channels (e.g. using homomorphic encryption or fancy vacuum-sealed Faraday cage boxes).
Usually the assumption is that a superintelligent AI will figure out what it is, and where it is, and how it works, and what all its output channels are (both intended and unintended), unless there is some strong reason to believe otherwise (example). I’m not sure this answers your question … I’m a bit confused at what you’re getting at.
AGIs will have a causal model of the world. If their own output is part of that model, and they work forward from there to the real-world consequences of their outputs, and they choose outputs partly based on those consequences, then it’s an agent by (my) definition. The outputs are called “actions” and the consequences are called “goals”. In all other cases, then I’d call it a service, unless I’m forgetting about some edge cases.
A system whose only output is text on a screen can be either a service or an agent, depending on the computational process generating the text. A simple test is that if there’s a weird, non-obvious way to manipulate the people reading the text (according the everyday, bad-connotation sense of “manipulate”), would the system take advantage of it? Agents would do so (by default, unless they had a complicated goal involving ethics etc.), services would not by default.
Nobody knows how to build a useful AI capable of world-modeling and formulating intelligent plans but which is not an agent, although I’m personally hopeful that it might be possible by self-supervised learning (cf. Self-Supervised Learning and AGI safety).
My own updates after I wrote that were:
Increased likelihood of self-supervised learning algorithms as either a big part or even the entirety of the technical path to AGI—insofar as self-supervised learning is the lion’s share of how the neocortex learning algorithm supposedly works. That’s why I’ve been writing posts like Self-Supervised Learning and AGI safety.
Shorter timelines and faster takeoff, insofar as we think the algorithm is not overwhelmingly complicated
Increased likelihood of “one algorithm to rule them all” over Comprehensive AI Services. This might be on the meta-level of one learning algorithm to rule them all, and we feed it biology books to get a superintelligent biologist, and separately we feed it psychology books and nonfiction TV to get a superintelligent psychological charismatic manipulator, etc. Or it might be on the base level of one trained model to rule them all, and we train it with all 50 million books and 100,000 years of YouTube and anything else we can find. The latter can ultimately be more capable (you understand biology papers better if you also understand statistics, etc. etc.), but on the other hand the former is more likely if there are scaling limits where memory access grinds to a halt after too many gigabytes get loaded into the world-model, or things like that. Either way, it would make it likelier for AGI (or at least the final missing ingredient of AGI) to be developed in one place, i.e. the search-engine model rather than the open-source software model.
People (and robots) model the world by starting with sensor data (vision, proprioception, etc.), then finding low-level (spatiotemporally-localized) patterns in that data, then higher-level patterns in the patterns, patterns in the patterns in the patterns, etc. I’m trying to understand how this relates to “abstraction” as you’re talking about it.
Sensor data, say the bits recorded by a video camera, is not a causal diagram, but it is already an “abstraction” in the sense that it has mutual information with the part of the world it’s looking at, but is many orders of magnitude less complicated. Do you see a video camera as an abstraction-creator / map-maker by itself?
What if the video camera has a MPEG converter? MPEGs can (I think) recognize that low-level pattern X tends to follow low-level pattern Y, and this is more-or-less the same low-level primitive out of which which humans build their sophisticated causal understanding of the world (according to my current understanding of the human brain’s world-modeling algorithms). So is a video camera with MPEG converter an abstraction-creator / map-maker? What’s your thinking?
(1) You might give some thought to trying to copy (or at least understand) the world model framework of the human brain. There’s uncertainty in how that works, but a lot is known, and you’ll at least be working towards something that we know for sure is capable of getting built up to a human level world-model within a reasonable amount of time and computation. As best as I can tell (and I’m working hard to understand it myself), and grossly oversimplifying, it’s a data structure with billions of discrete concepts, and transformations between those concepts (composition, cause-effect, analogy, etc...probably all of those are built out of the same basic “transformation machinery” with different contexts acting as metadata). All these concepts are sitting in the top layer of some kind of loose hierarchy, whose lowest layer consists of (higher-level-context-dependent) probability distributions over spatiotemporal sequences of sensory inputs. See my Jeff Hawkins post for one possible point of departure. I’ve found a couple other references that are indirectly helpful, and like I said, I’m still trying to figure it out. I’m still trying to understand the “sheaves” approach , so I won’t comment on how these compare.
(2) “This conception will be the result of an optimizer, and so this should be in the optimization provenance”—this seems to be important and I don’t understand it. Better understanding the world consists (in part) of chunking sequences of events and actions, suppressing intermediate steps. Thus we say and think “I’ll put some milk in my coffee,” leaving out the steps like unscrewing the top of the jug. The process of “explore the world model, chunking sequences of events when appropriate” is (I suspect) essential to making the world-model usable and powerful, and needs to be repeated millions of times in every nook and cranny of the world model, and thus this is a process that an overseer would have little choice but to approve in general, I think. But this process can find and chunk manipulative causal pathways just as well as any other kind of pathway. And once manipulation is packaged up inside a chunk, you won’t need optimization per se to manipulate, it will just be an obvious step in the process of doing something, just like unscrewing the top of the jug is an obvious step in putting-milk-into-coffee. I’m not sure how you propose to stop that from happening.
No, I would try to rule out stars based on “a-priori-specifiable consequence”—saying that “stars shine” would be painting a target around the arrow, i.e. reasoning about what the system would actually do and then saying that the system is going to do that. For example, expecting bacteria to take actions that maximize inclusive genetic fitness would certainly qualify as “a priori specifiable”. The other part is “more likely than chance”, which I suppose entails a range of possible actions/behaviors, with different actions/behaviors invoked in different possible universes, but leading to the same consequence regardless. (You can see how every step I make towards being specific here is also a step towards making my “theorem” completely trivial, X=X.)
Let’s say “agent-like behavior” is “taking actions that are more-likely-than-chance to create an a-priori-specifiable consequence” (this definition includes bacteria).
Then I’d say this requires “agent-like processes”, involving (at least) all 4 of: (1) having access to some information about the world (at least the local environment), including in particular (2) how one’s actions affect the world. This information can come either baked into the design (bacteria, giant lookup table), and/or from previous experience (RL), and/or via reasoning from input data. It also needs (3) an ability to use this information to choose actions that are likelier-than-chance to achieve the consequence in question (again, the outcome of this search process could be baked into the design like bacteria, or it could be calculated on-the-fly like human foresight), and of course (4) a tendency to actually execute those actions in question.
I feel like this is almost trivial, like I’m just restating the same thing in two different ways… I mean, if there’s no mutual information between the agent and the world, its actions can only be effective only insofar as the exact same action would be effective when executed in a random location of a random universe. (Does contracting your own muscle count as “accomplishing something without any world knowledge”?)
Anyway, where I’m really skeptical here is in the term “architecture”. “Architecture” in everyday usage usually implies software properties that are obvious parts of how a program is built, and probably put in on purpose. (Is there a more specific definition of “architecture” you had in mind?) I’m pretty doubtful that the ingredients 1-4 have to be part of the “architecture” in that sense. For example, I’ve been thinking a lot about self-supervised learning algorithms, which have ingredient (1) by design and have (3) sorta incidentally. The other two ingredients (2) and (4) are definitely not part of the “architecture” (in the sense above). But I’ve argued that they can both occur as unintended side-effects of its operation: See here, and also here for more details about (2). And thus I argue at that first link that this system can have agent-like behavior.
(And what’s the “architecture” of a bacteria anyway? Not a rhetorical question.)
Sorry if this is all incorrect and/or not in the spirit of your question.
Thanks, that’s helpful! I’ll have to think about the “self-consistent probability distribution” issue more, and thanks for the links. (ETA: Meanwhile I also added an “Update 2″ to the post, offering a different way to think about this, which might or might not be helpful.)
Let me try the gradient descent argument again (and note that I am sympathetic, and indeed I made (what I think is) that exact argument a few weeks ago, cf. Self-Supervised Learning and AGI Safety, section title “Why won’t it try to get more predictable data?”). My argument here is not assuming there’s a policy of trying to get more predictable data for its own sake, but rather that this kind of behavior arises as a side-effect of an algorithmic process, and that all the ingredients of that process are either things we would program into the algorithm ourselves or things that would be incentivized by gradient descent.
The ingredients are things like “Look for and learn patterns in all accessible data”, which includes both low-level patterns in the raw data, higher-level patterns in the lower-level patterns, and (perhaps unintentionally) patterns in accessible information about its own thought process (“After I visualize the shape of an elephant tusk, I often visualize an elephant shortly thereafter”). It includes searching for transformations (cause-effect, composition, analogies, etc.) between any two patterns it already knows about (“sneakers are a type of shoe”, or more problematically, “my thought processes resemble the associative memory of an AGI”), and cataloging these transformations when they’re found. Stuff like that.
So, “make smart hypotheses about one’s own embodied situation” is definitely an unintended side-effect, and not rewarded by gradient descent as such. But as its world-model becomes more comprehensive, and as it continues to automatically search for patterns in whatever information it has access to, “make smart hypotheses about one’s own embodied situation” would just be something that happens naturally, unless we somehow prevent it (and I can’t see how to prevent it). Likewise, “model one’s own real-world causal effects on downstream data” is neither desired by us nor rewarded (as such) by gradient descent. But it can happen anyway, as a side-effect of the usually-locally-helpful rule of “search through the world-model for any patterns and relationships which may impact our beliefs about the upcoming data”. Likewise, we have the generally-helpful rule “Hypothesize possible higher-level contexts that span an extended swathe of text surrounding the next word to be predicted, and pick one such context based on how surprising it would be based on what it knows about the preceding text and the world-model, and then make a prediction conditional on that context”. All these ingredients combine to get the pathological behavior of choosing “Help I’m trapped in a GPU”. That’s my argument, anyway...
RE “make the superintelligence assume that it is disembodied”—I’ve been thinking about this a lot recently (see The Self-Unaware AI Oracle) and agree with Viliam that knowledge-of-one’s-embodiment should be the default assumption. My reasoning is: A good world-modeling AI should be able to recognize patterns and build conceptual transformations between any two things it knows about, and also should be able to do reasoning over extended periods of time. OK, so let’s say it’s trying to figure something out something about biology, and it visualizes the shape of a tree. Now it (by default) has the introspective information “A tree has just appeared in my imagination!”. Likewise, if it goes through any kind of reasoning process, and can self-reflect on that reasoning process, then it can learn (via the same pattern-recognizing algorithm it uses for the external world) how that reasoning process works, like “I seem to have some kind of associative memory, I seem to have a capacity for building hierarchical generative models, etc.” Then it can recognize that these are the same ingredients present in those AGIs it read about in the newspaper. It also knows a higher-level pattern “When two things are built the same way, maybe they’re of the same type.” So now it has a hypothesis that it’s an AGI running on a computer.
It may be possible to prevent this cascade of events, by somehow making sure that “I am imagining a tree” and similar things never get written into the world model. I have this vision of two data-types, “introspective information” and “world-model information”, and your static type-checker ensures that the two never co-mingle. And voila, AI Safety! That would be awesome. I hope somebody figures out how to do that, because I sure haven’t. (Admittedly, I have neither time nor relevant background knowledge to try properly.) I’m also slightly concerned that, even if you figure out a way to cut off introspective knowledge, it might incidentally prevent the system from doing good reasoning, but I currently lean optimistic on that.
I think I have an example of “an optimizer_1 could turn into an optimizer_2 unexpectedly if it becomes sufficiently powerful”. I posted it a couple days ago: Self-supervised learning & manipulative predictions. A self-supervised learning system is an optimizer_1: It’s trying to predict masked bits in a fixed, pre-loaded set of data. This task does not entail interacting with the world, and we would presumably try hard to design it not to interact with the world.
However, if it was a powerful learning system with world-knowledge (via its input data) and introspective capabilities, it would eventually figure out that it’s an AGI and might hypothesize what environment it’s in, and then hypothesize that its operations could affect its data stream via unintended causal pathways, e.g. sending out radio signals. Then, if it used certain plausible types of heuristics as the basis for its predictions of masked bits, it could wind up making choices based on their downstream effects on itself via manipulating the environment. In other words, it starts acting like an optimizer_2.
I’m not super-confident about any of this and am open to criticism. (And I agree with you that this a useful distinction regardless; indeed I was arguing a similar (but weaker) point recently, maybe not as elegantly, at this link)
Thanks, that’s helpful!
The way I’m currently thinking about it, if we have an oracle that gives superintelligent and non-manipulative answers, things are looking pretty good for the future. When you ask it to design a new drug, you also ask some follow-up questions like “How does the drug work?” and “If we deploy this solution, how might this impact the life of a typical person in 20 years time?” Maybe it won’t always be able to give great answers, but as long as it’s not trying to be manipulative, it seems like we ought to be able to use such a system safely. (This would, incidentally, entail not letting idiots use the system.)
I agree that extracting information from a self-supervised learner is a hard and open problem. I don’t see any reason to think it’s impossible. The two general approaches would be:
Manipulate the self-supervised learning environment somehow. Basically, the system is going to know lots of different high-level contexts in which the statistics of low-level predictions are different—think about how GPT-2 can imitate both middle school essays and fan-fiction. We would need to teach it a context in which we expect the text to reflect profound truths about the world, beyond what any human knows. That’s tricky because we don’t have any such texts in our database. But maybe if we put a special token in the 50 most clear and insightful journal articles ever written, and then stick that same token in our question prompt, then we’ll get better answers. That’s just an example, maybe there are other ways.
Forget about text prediction, and build an entirely separate input-output interface into the world model. The world model (if it’s vaguely brain-like) is “just” a data structure with billions of discrete concepts, and transformations between those concepts (composition, cause-effect, analogy, etc...probably all of those are built out of the same basic “transformation machinery”). All these concepts are sitting in the top layer of some kind of hierarchy, whose lowest layer consists of probability distributions over short snippets of text (for a language model, or more generally whatever the input is). So that’s the world model data structure. I have no idea how to build a new interface into this data structure, or what that interface would look like. But I can’t see why that should be impossible...
I do think I understand that. I see E as a means to an end. It’s a way to rank-order choices and thus make good choices. If I apply an affine transformation to E, e.g. I’m way too optimistic about absolutely everything in a completely uniform way, then I still make the same choice, and the choice is what matters. I just want my AGI to do the right thing.
Here, I’ll try to put what I’m thinking more starkly. Let’s say I somehow design a comparative AGI. This is a system which can take a merit function U, and two choices C_A and C_B, and it can predict which of the two choices C_A or C_B would be better according to merit function U, but it has no idea how good either of those two choices actually are on any absolute scale. It doesn’t know whether C_A is wonderful while C_B is even better, or whether C_A is awful while C_B is merely so-so, both of those just return the same answer, “C_B is better”. Assume it’s not omniscient, so its comparisons are not always correct, but that it’s still impressively superintelligent.
A comparative AGI does not suffer the optimizer’s curse, right? It never forms any beliefs about how good its choices will turn out, so it couldn’t possibly be systematically disappointed. There’s always noise and uncertainty, so there will be times when its second-highest-ranked choice would actually turn out better than its highest-ranked choice. But that happens less often than chance. There’s no systematic problem: in expectation, the best thing to do (as measure by U) is always to take its top-ranked choice.
Now, it seems to me that, if I go to the AGIs-R-Us store, and I see a normal AGI and a comparative AGI side-by-side on the shelf, I would have no strong opinion about which one of them I should buy. If I ask either one to do something, they’ll take the same sequence of actions in the same order, and get the same result. They’ll invest my money in the same stocks, offer me the same advice, etc. etc. In particular, I would worry about Goodhart’s law (i.e. giving my AGI the wrong function U) with either of these AGIs to the exact same extent and for the exact same reason...even though one is subject to optimizer’s curse and the other isn’t.
I don’t think it’s related to mild optimization. Pick a target T that can be exceeded (wonderful future, even if it’s not the absolute theoretically best possible future). Estimate what choice Cmax is (as far as we can tell) the #1 very best by that metric. We expect Cmax to give value E, and it turns out to be V<E, but V is still likely to exceed T, or at least likelier than any other choice. (Insofar as that’s not true, it’s Goodhart.) Optimizer curse, i.e. V<E, does not seem to be a problem, or even relevant, because I don’t ultimately care about E. Maybe the AI doesn’t even tell me what E is. Maybe the AI doesn’t even bother guessing what E is, it only calculates that Cmax seems to be better than any other choice.
Ah, thanks for clarifying.
The first entry on my “list of pathological things” wound up being a full blog post in length: See Self-supervised learning and manipulative predictions.
RE daemons, I wrote in that post (and have been assuming all along): “I’m assuming that we will not do a meta-level search for self-supervised learning algorithms… Instead, I am assuming that the self-supervised learning algorithm is known and fixed (e.g. “Transformer + gradient descent” or “whatever the brain does”), and that the predictive model it creates has a known framework, structure, and modification rules, and that only its specific contents are a hard-to-interpret complicated mess.” The contents of a world-model, as I imagine it, is a big data structure consisting of gajillions of “concepts” and “transformations between concepts”. It’s a passive data structure, therefore not a “daemon” in the usual sense. Then there’s a KANSI (Known Algorithm Non Self Improving) system that’s accessing and editing the world model. I also wouldn’t call that a “daemon”, instead I would say “This algorithm we wrote can have pathological behavior...”