My AI Model Delta Compared To Yudkowsky

Preamble: Delta vs Crux

I don’t natively think in terms of cruxes. But there’s a similar concept which is more natural for me, which I’ll call a delta.

Imagine that you and I each model the world (or some part of it) as implementing some program. Very oversimplified example: if I learn that e.g. it’s cloudy today, that means the “weather” variable in my program at a particular time[1] takes on the value “cloudy”. Now, suppose your program and my program are exactly the same, except that somewhere in there I think a certain parameter has value 5 and you think it has value 0.3. Even though our programs differ in only that one little spot, we might still expect very different values of lots of variables during execution—in other words, we might have very different beliefs about lots of stuff in the world.

If your model and my model differ in that way, and we’re trying to discuss our different beliefs, then the obvious useful thing-to-do is figure out where that one-parameter difference is.

That’s a delta: one or a few relatively “small”/​local differences in belief, which when propagated through our models account for most of the differences in our beliefs.

For those familiar with Pearl-style causal models: think of a delta as one or a few do() operations which suffice to make my model basically match somebody else’s model, or vice versa.

This post is about my current best guesses at the delta between my AI models and Yudkowsky’s AI models. When I apply the delta outlined here to my models, and propagate the implications, my models basically look like Yudkowsky’s as far as I can tell. That said, note that this is not an attempt to pass Eliezer’s Intellectual Turing Test; I’ll still be using my own usual frames.

This post might turn into a sequence if there’s interest; I already have another one written for Christiano, and people are welcome to suggest others they’d be interested in.

My AI Model Delta Compared To Yudkowsky

Best guess: Eliezer basically rejects the natural abstraction hypothesis. He mostly expects AI to use internal ontologies fundamentally alien to the ontologies of humans, at least in the places which matter. Lethality #33 lays it out succinctly:

33. The AI does not think like you do, the AI doesn’t have thoughts built up from the same concepts you use, it is utterly alien on a staggering scale. Nobody knows what the hell GPT-3 is thinking, not only because the matrices are opaque, but because the stuff within that opaque container is, very likely, incredibly alien—nothing that would translate well into comprehensible human thinking, even if we could see past the giant wall of floating-point numbers to what lay behind.

What do my models look like if I propagate that delta? In worlds where natural abstraction basically fails, we are thoroughly and utterly fucked, and a 99% probability of doom strikes me as entirely reasonable and justified.

Here’s one oversimplified doom argument/​story in a world where natural abstraction fails hard:

  1. Humanity is going to build superhuman goal-optimizing agents. (’Cause, like, obviously somebody’s going to do that, there’s no shortage of capabilities researchers loudly advertising that they’re aiming to do that exact thing.) These will be so vastly more powerful than humans that we have basically-zero bargaining power except insofar as AIs are aligned to our interests.

  2. We’re assuming natural abstraction basically fails, so those AI systems will have fundamentally alien internal ontologies. For purposes of this overcompressed version of the argument, we’ll assume a very extreme failure of natural abstraction, such that human concepts cannot be faithfully and robustly translated into the system’s internal ontology at all. (For instance, maybe a faithful and robust translation would be so long in the system’s “internal language” that the translation wouldn’t fit in the system.)

  3. Then:

    1. Obviously full value alignment is out.

    2. Robust and faithful instruction following or “do what I mean” is out; the meaning of human words/​concepts can’t be robustly and faithfully represented in the system’s internal ontology at all.

    3. Corrigibility is out, unless (here lies one of Eliezer’s hopes) corrigibility turns out to be such a natural concept that it can faithfully and robustly translate even into the ontology of a very alien AI.

    4. Insofar as an AI cares-as-a-terminal-goal about keeping humans around, it will care about its own alien conception of “humans” which does not match ours, and will happily replace us with less resource-intensive (or otherwise preferable) things which we would not consider “human”.

    5. Interpretability is, at best, some weak correlative heuristics which won’t generalize well. The lack of 99% reliability in mechinterp is not just because our current methods are primitive.

    6. Etc, etc. All of the technical alignment hopes are out, unless we posit some objective natural enough that it can be faithfully and robustly translated into the AI’s internal ontology despite the alien-ness.

  4. It’s not like this gets any better over time; if anything, AIs’ internal ontologies just keep getting more alien as their power level ramps up.

  5. … so we die as soon as one of these superhuman goal-optimizing agents applies enough optimization pressure to the world and the faithfulness/​robustness of the translation fails. (Actually, Eliezer expects, we’re likely to die of easier problems before then, but even if our species’ competence is far higher than currently seems, the translation problem would kill us.)

  6. As an added bonus, the AIs will know all this (‘cause, y’know, they’re smart), will therefore know that divergence between their goals and humans’ goals is inevitable (because their goals are in fundamentally alien ontologies and therefore will diverge out-of-distribution), and will therefore be incentivized to strategically hide their long-term intentions until it’s time for the humans to go.

Note that the “oversimplification” of the argument mostly happened at step 2; the actual expectation here would be that a faithful and robust translation of human concepts is long in the AI’s internal language, which means we would need very high precision in order to instill the translation. But that gets into a whole other long discussion.

By contrast, in a world where natural abstraction basically works, the bulk of human concepts can be faithfully and robustly translated into the internal ontology of a strong AI (and the translation isn’t super-long). So, all those technical alignment possibilities are back on the table.

That hopefully gives a rough idea of how my models change when I flip the natural abstraction bit. It accounts for most of the currently-known-to-me places where my models diverge from Eliezer’s. I put nontrivial weight (maybe about 10-20%) on the hypothesis that Eliezer is basically correct on this delta, though it’s not my median expectation.

  1. ^

    particular time = particular point in the unrolled execution of the program