Vanessa Kosoy comments on AGI Ruin: A List of Lethalities

Vanessa Kosoy 6 Jun 2022 9:55 UTC
LW: 115 AF: 27
30
AF
First, some remarks about the meta-level:

The ability to do new basic work noticing and fixing those flaws is the same ability as the ability to write this document before I published it, which nobody apparently did, despite my having had other things to do than write this up for the last five years or so. Some of that silence may, possibly, optimistically, be due to nobody else in this field having the ability to write things comprehensibly—such that somebody out there had the knowledge to write all of this themselves, if they could only have written it up, but they couldn’t write, so didn’t try. I’m not particularly hopeful of this turning out to be true in real life, but I suppose it’s one possible place for a “positive model violation” (miracle). The fact that, twenty-one years into my entering this death game, seven years into other EAs noticing the death game, and two years into even normies starting to notice the death game, it is still Eliezer Yudkowsky writing up this list, says that humanity still has only one gamepiece that can do that.

Actually, I don’t feel like I learned that much reading this list, compared to what I already knew. [EDIT: To be clear, this knowledge owes a lot to prior inputs from Yudkowsky and the surrounding intellectual circle, I am making no claim that I would derive it all independently in a world in which Yudkowsky and MIRI didn’t exit.] To be sure, it didn’t feel like a waste of time, and I liked some particular framings (e.g. in A.4 separating the difficulty into “unlimited time but 1 try” and “limited time with retries”), but I think I could write something that would be similar (in terms of content; it would be very likely much worse in terms of writing quality).

One reason I didn’t write such a list is, I don’t have the ability to write things comprehensibly. Empirically, everything of substance that I write is notoriously difficult for readers to understand. Another reason is, at some point I decided to write top-level posts only when I have substantial novel mathematical results, with rare exceptions. This is in part because I feel like the field has too much hand-waving and philosophizing and too little hard math (which rhymes with C.38). In part it is because, even if people can’t understand the informal component of my reasoning, they can at least understand there is math here and, given sufficient background, follow the definitions/theorems/proofs (although tbh few people follow).

There’s no plan

Actually, I do have a plan. It doesn’t have an amazing probability of success (my biggest concerns are (i) not enough remaining time and (ii) even if the theory is ready in time, the implementation can be bungled, in particular for reasons of operational adequacy), but it is also not practically useless. The last time I tried to communicate it was 4 years ago, since which time it obviously evolved. Maybe it’s about time to make another attempt, although I’m wary of spending a lot of effort on something which few people will understand.

Now, some technical remarks:

Humans don’t explicitly pursue inclusive genetic fitness; outer optimization even on a very exact, very simple loss function doesn’t produce inner optimization in that direction. This happens in practice in real life, it is what happened in the only case we know about, and it seems to me that there are deep theoretical reasons to expect it to happen again: the first semi-outer-aligned solutions found, in the search ordering of a real-world bounded optimization process, are not inner-aligned solutions.

This is true, but it is notable that deep learning is not equivalent to evolution, and the differences are important. Consider for example a system that is designed to separately (i) learn a generative model of the environment and (ii) search for plans effective on this model (model-based RL). Then, module ii doesn’t inherently have the problem where the solution only optimizes the correct thing in the training environment. Because, this module is not bounded by available training data, but only by compute. The question is then, to 1st approximation, whether module i is able to correctly generalize from the training data (obviously there are theoretical bounds on how good such this generalization can be; but we want this generalization to be at least as good as human ability and without dangerous biases). I do not think current systems do such generalization correctly, although they do seem to have some ingredients right, in particular Occam’s razor / simplicity bias. But we can imagine some algorithm that does.

...on the current optimization paradigm there is no general idea of how to get particular inner properties into a system, or verify that they’re there, rather than just observable outer ones you can run a loss function over.

Also true, but there is nuance. The key problem is that we don’t know why deep learning works, or more specifically w.r.t. which prior does it satisfy good generalization bounds. If we knew what this prior is, then we could predict some inner properties. For example, if you know your algorithm follows Occam’s razor, for a reasonable formalization of “Occam’s razor”, and you trained it on the sun setting every day for a million days, then you can predict that the algorithm will not confidently predict the sun is going to fail to to set on any given future day. Moreover, our not knowing such generalization bounds for deep learning is a fact about our present state of mathematical ignorance, not a fact about the algorithms themselves.

...there is no known way to use the paradigm of loss functions, sensory inputs, and/or reward inputs, to optimize anything within a cognitive system to point at particular things within the environment.

It is true that (AFAIK) nothing like this was accomplished in practice, but the distance to that might not be too great. For example, I can imagine training an ANN to implement a POMDP which simultaneously successfully predicts the environment and complies with some “ontological hypothesis” about how the environment needs to be structured in order for the-things-we-want-to-point-at to be well-defined (technically, this POMDP needs to be a refinement of some infra-POMPD that represents the ontological hypothesis).

The first thing generally, or CEV specifically, is unworkable because the complexity of what needs to be aligned or meta-aligned for our Real Actual Values is far out of reach for our FIRST TRY at AGI. Yes I mean specifically that the dataset, meta-learning algorithm, and what needs to be learned, is far out of reach for our first try. It’s not just non-hand-codable, it is unteachable on-the-first-try because the thing you are trying to teach is too weird and complicated.

There is a big chunk of what you’re trying to teach which not weird and complicated, namely: “find this other agent, and what their values are”. Because, “agents” and “values” are natural concepts, for reasons strongly related to “there’s a relatively simple core structure that explains why complicated cognitive machines work”. Admittedly, my rough proposal (PreDCA) does have some “weird and complicated” parts because of the acausal attack problem.

Any pivotal act that is not something we can go do right now, will take advantage of the AGI figuring out things about the world we don’t know so that it can make plans we wouldn’t be able to make ourselves. It knows, at the least, the fact we didn’t previously know, that some action sequence results in the world we want. Then humans will not be competent to use their own knowledge of the world to figure out all the results of that action sequence.

This is inaccurate, because $P \neq N P$ . It is possible to imagine an AI that provides us with a plan for which we simultaneously (i) can understand why it works and (ii) wouldn’t think of it ourselves without thinking for a very long time that we don’t have. At the very least, the AI could suggest a way of building a more powerful aligned AI. Of course, in itself this doesn’t save us at all: instead of producing such a helpful plan, the AI can produce a deceitful plan instead. Or a plan that literally makes everyone who reads it go insane in very specific ways. Or the AI could just hack the hardware/software system inside which it’s embedded to produce a result which counts for it as a high reward but which for us wouldn’t look anything like “producing a plan the overseer rates high”. But, this direction might be not completely unsalvageable^[1].

Human thought partially exposes only a partially scrutable outer surface layer. Words only trace our real thoughts. Words are not an AGI-complete data representation in its native style. The underparts of human thought are not exposed for direct imitation learning and can’t be put in any dataset. This makes it hard and probably impossible to train a powerful system entirely on imitation of human words or other human-legible contents, which are only impoverished subsystems of human thoughts; unless that system is powerful enough to contain inner intelligences figuring out the humans, and at that point it is no longer really working as imitative human thought.

I agree that the process of inferring human thought from the surface artifacts of human thought require powerful non-human thought which is dangerous in itself. But this doesn’t necessarily mean that the idea of imitating human though doesn’t help at all. We can combine it with techniques such as counterfactual oracles and confidence thresholds to try to make sure the resulting agent is truly only optimizing for accurate imitation (which still leaves problems like attacks from counterfactuals and non-Cartesian daemons, and also not knowing which features of the data are important to imitate might be a big capability handicap).
1. ↩︎
  That said, I feel that PreDCA is more promising than AQD: it seems to require less fragile assumptions and deals more convincingly with non-Cartesian daemons. [EDIT: AQD also can’t defend from acausal attack if the malign hypothesis has massive advantage in prior probability mass, and it’s quite likely to have that. It does not work to solve this by combining AQD with IBP, at least not naively.]
What links here?
- Vanessa Kosoy's comment on A central AI alignment problem: capabilities generalization, and the sharp left turn by So8res (15 Jun 2022 18:24 UTC; 50 points)
- Rob Bensinger 6 Jun 2022 11:05 UTC
  LW: 22 AF: 4
  9
  AF Parent
  There is a big chunk of what you’re trying to teach which not weird and complicated, namely: “find this other agent, and what their values are”. Because, “agents” and “values” are natural concepts, for reasons strongly related to “there’s a relatively simple core structure that explains why complicated cognitive machines work”.
  This seems like it must be true to some degree, but “there is a big chunk” feels a bit too strong to me.
  Possibly we don’t disagree, and just have different notions of what a “big chunk” is. But some things that make the chunk feel smaller to me:
  - Humans are at least a little coherent, or we would never get anything done; but we aren’t very coherent, so the project of piecing together ‘what does the human brain as a whole “want”’ can be vastly more difficult than the problem of figuring out what a coherent optimizer wants.
  - There are shards of planning and optimization and goal-oriented-ness in a cat’s brain, but ‘figure out what utopia would look like for a cat’ is a far harder problem than ‘identify all of the goal-encoding parts of the cat’s brain and “read off” those goals’. E.g., does ‘identifying utopia’ in this context involve uplifting or extrapolating the cat? Why, or why not? And if so, how does that process work?
  - Getting a natural concept into an agent’s goal is a lot harder than getting it into an agent’s beliefs. Indeed, in the context of goals I’m not sure ‘naturalness’ actually helps at all, except insofar as natural kinds tend to be simple and simple targets are easier to hit?
    An obvious way naturalness could help, over and above simplicity, is if we have some value-loading technique that leverages or depends on “this concept shows up in the AGI’s world-model”. More natural concepts can show up in AGI world-models more often than simpler-but-less-natural concepts, because the natural concept is more useful for making sense of sensory data.
  - Vanessa Kosoy 6 Jun 2022 12:37 UTC
    LW: 25 AF: 9
    13
    AF Parent
    Humans are at least a little coherent, or we would never get anything done; but we aren’t very coherent, so the project of piecing together ‘what does the human brain as a whole “want”’ can be vastly more difficult than the problem of figuring out what a coherent optimizer wants.
    
    This is a point where I feel like I do have a substantial disagreement with the “conventional wisdom” of LessWrong.
    
    First, LessWrong began with a discussion of cognitive biases in human irrationality, so this naturally became a staple of the local narrative. On the other hand, I think that a lot of presumed irrationality is actually rational but deceptive behavior (where the deception runs so deep that it’s part of even our inner monologue). There are exceptions, like hyperbolic discounting, but not that many.
    
    Second, the only reason why the question “what X wants” can make sense at all, is because X is an agent. As a corollary, it only makes sense to the extent that X is an agent. Therefore, if X is not entirely coherent then X’s preferences are only approximately defined, and hence we only need to infer them approximately. So, the added difficulty of inferring X’s preferences, resulting from the partial incoherence of these preference, is, to large extent, cancelled out by the reduction in the required precision of the answer. The way I expect this cache out is, when the agent has $g < \infty$ , the utility function is only approximately defined, and we can infer it within this approximation. As $g$ approaches infinity, the utility function becomes crisply defined^[1] and can be inferred crisply. See also additional nuance in my answer to the cat question below.
    
    This is not to say we shouldn’t investigate models like dynamically inconsistent preferences or “humans as systems of agents”, but that I expect the number of additional complications of this sort that are actually important to be not that great.
    
    There are shards of planning and optimization and goal-oriented-ness in a cat’s brain, but ‘figure out what utopia would look like for a cat’ is a far harder problem than ‘identify all of the goal-encoding parts of the cat’s brain and “read off” those goals’. E.g., does ‘identifying utopia’ in this context involve uplifting or extrapolating the cat? Why, or why not? And if so, how does that process work?
    
    I’m actually not sure that cats (as opposed to humans) are sufficiently “general” intelligence for the process to make sense. This is because I think humans are doing something like Turing RL (where consciousness plays the role of the “external computer”), and value learning is going to rely on that. The issue is, you don’t only need to infer the agent’s preferences but you also need to optimize them better than the agent itself. This might pose a difficulty, if, as I suggested above, imperfect agents have imperfectly defined preferences. While I can see several hypothetical solutions, the TRL model suggests a natural approach where the AI’s capability advantage is reduced to having a better external computer (and/or better interface with that computer). This might not apply to cats which (I’m guessing) don’t have this kind of consciousness^[2] because (I’m guessing) the evolution of consciousness was tied to language and social behavior.
    
    Getting a natural concept into an agent’s goal is a lot harder than getting it into an agent’s beliefs. Indeed, in the context of goals I’m not sure ‘naturalness’ actually helps at all, except insofar as natural kinds tend to be simple and simple targets are easier to hit?
    
    I’m not saying that the specific goals human have are natural: they are a complex accident of evolution. I’m saying that the general correspondence between agents and goals is natural.
    
    ↩︎
    Asymptotically crisply: some changes are too small to affect the optimal policy, but I’m guessing that they become negligible when considering longer and longer timescales.
    
    ↩︎
    This is not to say cat’s don’t have quasimoral value: I think they do.
    - Rob Bensinger 6 Jun 2022 13:43 UTC
      LW: 10 AF: 4
      7
      AF Parent
      Second, the only reason why the question “what X wants” can make sense at all, is because X is an agent. As a corollary, it only makes sense to the extent that X is an agent.
      I’m not sure this is true; or if it’s true, I’m not sure it’s relevant. But assuming it is true...
      Therefore, if X is not entirely coherent then X’s preferences are only approximately defined, and hence we only need to infer them approximately.
      … this strikes me as not capturing the aspect of human values that looks strange and complicated. Two ways I could imagine the strangeness and complexity cashing out as ‘EU-maximizer-ish’ are:
      Maybe I sort-of contain a lot of subagents, and ‘my values’ are the conjunction of my sub-agents’ values (where they don’t conflict), plus the output of an idealized negotiation between my sub-agents (where they do conflict).
      Alternatively, maybe I have a bunch of inconsistent preferences, but I have a complicated pile of meta-preferences that collectively imply some chain of self-modifications and idealizations that end up producing something more coherent and utility-function-ish after a long sequence of steps.
      In both cases, the fact that my brain isn’t a single coherent EU maximizer seemingly makes things a lot harder and more finnicky, rather than making things easier. These are cases where you could say that my initial brain is ‘only approximately an agent’, and yet this comes with no implication that there’s any more room for error or imprecision than if I were an EU maximizer.
      I’m not saying that the specific goals human have are natural: they are a complex accident of evolution. I’m saying that the general correspondence between agents and goals is natural.
      Right, but this doesn’t on its own help get that specific relatively-natural concept into the AGI’s goals, except insofar as it suggests “the correspondence between agents and goals” is a simple concept, and any given simple concept is likelier to pop up in a goal than a more complex one.
      - Vanessa Kosoy 6 Jun 2022 14:08 UTC
        LW: 8 AF: 3
        3
        AF Parent
        
        Second, the only reason why the question “what X wants” can make sense at all, is because X is an agent. As a corollary, it only makes sense to the extent that X is an agent.
        
        I’m not sure this is true; or if it’s true, I’m not sure it’s relevant.
        
        If we go down that path then it becomes the sort of conversation where I have no idea what common assumptions do we have, if any, that we could use to agree. As a general rule, I find it unconstructive, for the purpose of trying to agree on anything, to say things like “this (intuitively compelling) assumption is false” unless you also provide a concrete argument or an alternative of your own. Otherwise the discussion is just ejected into vacuum. Which is to say, I find it self-evident that “agents” are exactly the sort of beings that can “want” things, because agency is about pursuing objectives and wanting is about the objectives that you pursue. If you don’t believe this then I don’t know what these words even mean for you.
        
        Maybe I sort-of contain a lot of subagents, and ‘my values’ are the conjunction of my sub-agents’ values (where they don’t conflict), plus the output of an idealized negotiation between my sub-agents (where they do conflict).
        
        Maybe, and maybe this means we need to treat “composite agents” explicitly in our models. But, there is also a case to be made that groups of (super)rational agents effectively converge into a single utility function, and if this is true, then the resulting system can just as well be interpreted as a single agent having this effective utility function, which is a solution that should satisfy the system of agents according to their existing bargaining equilibrium.
        
        Alternatively, maybe I have a bunch of inconsistent preferences, but I have a complicated pile of meta-preferences that collectively imply some chain of self-modifications and idealizations that end up producing something more coherent and utility-function-ish after a long sequence of steps.
        
        If your agent converges to optimal behavior asymptotically, then I suspect it’s still going to have infinite $g$ and therefore an asymptotically-crisply-defined utility function.
        
        Right, but this doesn’t on its own help get that specific relatively-natural concept into the AGI’s goals, except insofar as it suggests “the correspondence between agents and goals” is a simple concept, and any given simple concept is likelier to pop up in a goal than a more complex one.
        
        Of course it doesn’t help on its own. What I mean is, we are going to find a precise mathematical formalization of this concept and then hard-code this formalization into our AGI design.
        Rob Bensinger 6 Jun 2022 20:22 UTC
        LW: 5 AF: 1
        3
        AF Parent
        If we go down that path then it becomes the sort of conversation where I have no idea what common assumptions do we have, if any, that we could use to agree. As a general rule, I find it unconstructive, for the purpose of trying to agree on anything, to say things like “this (intuitively compelling) assumption is false” unless you also provide a concrete argument or an alternative of your own. Otherwise the discussion is just ejected into vacuum.
        Fair enough! I don’t think I agree in general, but I think ‘OK, but what’s your alternative to agency?’ is an especially good case for this heuristic.
        Which is to say, I find it self-evident that “agents” are exactly the sort of beings that can “want” things, because agency is about pursuing objectives and wanting is about the objectives that you pursue.
        The first counter-example that popped into my head was “a mind that lacks any machinery for considering, evaluating, or selecting actions; but it does have machinery for experiencing more-pleasurable vs. less pleasurable states”. This is a mind we should be able to build, even if it would never evolve naturally.
        Possibly this still qualifies as an “agent” that “wants” and “pursues” things, as you conceive it, even though it doesn’t select actions?
        Vanessa Kosoy 7 Jun 2022 6:23 UTC
        LW: 9 AF: 1
        1
        AF Parent
        My 0th approximation answer is: you’re describing something logically incoherent, like a p-zombie.
        
        My 1st approximation answer is more nuanced. Words that, in the pre-Turing era, referred exclusively to humans (and sometimes animals, and fictional beings), such as “wants”, “experiences” et cetera, might have two different referents. One referent is a natural concept, something tied into deep truths about how the universe (or multiverse) works. In particular, deep truths about the “relatively simple core structure that explains why complicated cognitive machines work”. The other referent is something in our specifically-human “ontological model” of the world (technically, I imagine that to be an infra-POMDP that all our hypotheses our refinements of). Since the latter is a “shard” of the former produced by evolution, the two referents are related, but might not be the same. (For example, I suspect that cats lack natural!consciousness but have human!consciousness.)
        
        The creature you describe does not natural!want anything. You postulated that it is “experiencing more pleasurable and less pleasurable states”, but there is no natural method that would label its states as such, or that would interpret them as any sort of “experience”. On the other hand, maybe if this creature is designed as a derivative of the human brain, then it does human!want something, because our shard of the concept of “wanting” mislabels (relatively to natural!want) weird states that wouldn’t occur in the ancestral environment.
        
        You can then ask, why should we design the AI to follow what we natural!want rather than what we human!want? To answer this, notice that, under ideal conditions, you converge to actions that maximize your natural!want, (more or less) according to definition of natural!want. In particular, under ideal conditions, you would build an AI that follows your natural!want. Hence, it makes sense to take a shortcut and “update now to the view you will predictably update to later”: namely, design the AI to follow your natural!want.