My take on Vanessa Kosoy’s take on AGI safety

Confidence level: Low

Vanessa Kosoy is a deep fountain of knowledge and insights about AGI safety, but I’ve had trouble understanding some aspects of her point of view. Part of the problem is just pedagogy, and part of it (I will argue) is that she has some different underlying assumptions and beliefs than I do. This post aims to address both those things. In particular, on the pedagogy front, I will try to give a sense for what Vanessa is doing and why, assuming minimal knowledge of either math or theoretical CS. (At least, that’s my intention—please let me know if anything is confusing or jargon-y.)

Here’s an example of where we differ. I tend to think of things like “the problem of wireheading” and “the problem of ontological crises” etc. as being on the critical path to AGI safety—as in, I think that, to build safe AGIs, we’ll need to be talking explicitly about these specific problems, and others like them, and to be addressing those specific problems with specific solutions. But Vanessa seems to disagree. What’s the root cause of that disagreement? More to the point, am I wasting my time, thinking about the wrong things?

Vanessa responds: Actually I don’t think I disagree? I don’t like the name “ontological crisis” since I think it presupposes a particular framing that’s not necessarily useful. However I do think it’s important to understand how agents can have utility functions that depend on unobservable quantities. I talked about it in Reinforcement Learning With Imperceptible Rewards and have more to say in an upcoming post.

Let’s find out!

Many thanks to Vanessa for patiently engaging with me. Also, thanks to Adam Shimi & Logan Smith for comments on a draft.

Summary & Table of Contents

  • Section 1 is just getting situated, i.e. what is the problem we’re trying to solve here?

  • In Section 2, I compare the more popular “algorithms-first approach” to Vanessa’s “desiderata-first approach”. In brief, the former is when you start with an AGI-relevant algorithm and figure out how to make it safe. The latter is when you come up with one or more precise criteria, called desiderata, such that if an algorithm satisfies the desiderata, then it would be safe. Then you go try to find algorithms for which you can prove that they satisfy the desiderata.

  • Sections 3-5 go through the three ingredients needed for AGI safety in Vanessa’s “desiderata-first approach”:

    • Section 3 covers the part where we prove that an AI algorithm satisfies some precisely-defined desiderata. I’ll cover some key background concepts (“regret bounds”, “traps”, “realizability”), and some of Vanessa’s related ideas (“Delegative Reinforcement Learning”, “Infra-Bayesianism”), and how they’re all connected.

    • Section 4 covers the part where we come up with good desiderata. To give a taste of what Vanessa has in mind, I give an intuitive walk-through of a particular example she came up with recently: “The Hippocratic Principle” desideratum, and “Hippocratic Timeline-Driven Learning”, an example type of algorithm that would satisfy the desideratum.

    • Section 5 covers “non-Cartesian daemons”. This part is basically filling in a loophole in the “desiderata-first” framework, namely ruling out bad behaviors unrelated to the AI’s nominal output, like if the AI hacks into the operating system that it’s running on.

  • Section 6 switches to my own opinions:

    • In Section 6.1, I circle back to the “algorithms-first” vs “desiderata-first” distinction from Section 2, arguing that there’s less to it than it first appears, and that a more important difference is the approach to “weird failure modes that x-risk people talk about” (wireheading, ontological crises, deceptive mesa-optimizers, incorrigibility, gradient hacking, etc. etc.). I suggest that I and other “algorithms-first” people view these as problems that need to be tackled head-on, whereas Vanessa believes that we can rule out those problems indirectly. An example of this kind of indirect argument is: if we set things up right, then an incorrigible AI—e.g. an AI that seizes control of its own off-switch—would score poorly on its objective function. So if we prove that the AI will score well on its objective (via a “regret bound”), then it follows indirectly that the AI will be corrigible. I express skepticism about this kind of indirect argument, and the next few subsections are an attempt to flesh out the source of that skepticism.

    • In Section 6.2, I define the concepts of “inner alignment” and “outer alignment”, and relate them to this discussion, focusing in particular on inner alignment.

    • In Section 6.3, I talk about the process of acquiring a very good world-model /​ hypotheses space /​ prior. I suggest that, in future powerful AGI-capable algorithms, this process will look less like simple Bayesian credence-updates, and more like active agent-like algorithms that apply planning and meta-cognition and other capabilities towards the refining the world-model /​ prior (I call this “RL-on-thoughts”). I suggest that, for algorithms of this type, we will not be able to start with a regret bound and use it to indirectly rule out those weird failure modes mentioned above. Instead, we would need to directly rule out those weird failure modes before having any chance of proving a regret bound.

    • Section 6.4 has some summary and wrap-up.

(To be clear, while we have some differences of opinion /​ vision, I think Vanessa is doing great work and should absolutely carry on doing what she’s doing. More on that in Section 6.)

1. Biggest picture

Before we dive in, let’s just situate ourselves in the biggest picture. I think Vanessa & I are both on the same page here: we’re trying to solve the (technical) AGI safety problem, sometimes called “single-single alignment”—the red circle in this diagram.

This post is all about the bottom-left. Caveat on this diagram: Needless to say, these are not perfectly clean divisions into unrelated problems. For example, if “the instruction manual for safe AGI” is extremely easy and cheap to follow, then it’s more likely that people will actually follow it.

2. Algorithms-First vs Desiderata-First

Let’s talk about two big-picture research strategies to get safe AGI.

Start with the “algorithms-first” approach on the left. This is not where we’ll find Vanessa. But it’s a very common approach in AGI safety! Some people who work on the left side include:

  • Me! I’m on the left side! My “particular algorithm that plausibly scales to AGI” is the human brain algorithm, or more precisely the several interconnected learning and inference algorithms centered around the neocortex. (There are other things happening in the brain too, but I would argue that they aren’t AGI-relevant. One example is the brainstem circuitry that regulates your heart rate.)

  • Everyone in “prosaic AGI safety” is on the left side too—Their “particular algorithm that plausibly scales to AGI” is any of the deep neural net algorithms typically used by ML researchers today, or variants thereof.

  • There are numerous AGI safety discussions—especially older ones—that assume that AGI will look like “a rational agent with a utility function”, and then proceed to talk about what we want for their utility function, decision function, etc. Those discussions are (at least arguably /​ partly) on the left side as well—in the sense that we’re talking about properties that a future AGI algorithm might have, and the behaviors and failure modes of algorithms with those properties, and then how to fix those failure modes.

By contrast, the right side is Vanessa’s “desiderata-first approach”. A “desideratum” (plural: “desiderata”) is a Latin word meaning “a thing that is desired”. (Hooray, 500 hours of mandatory Latin education in middle school has finally paid off!! oh wait I could have just googled it.)

The idea is: we find desiderata which, if satisfied, give safe AGI. Then we find algorithms for which we can get “formal guarantees” (i.e. proofs) that they will satisfy the desiderata, and then we win!

(Teaser: In Section 6.1 below I’m going to circle back to “algorithms-first vs desiderata-first”, and argue that the approaches are not so different after all.)

So to get at Vanessa’s perspective, we need to talk about two things: “learning algorithms with formal guarantees”, and “desiderata”. That’s where we’re going next: Sections 3 and 4 respectively.

3. Formal guarantees

3.1 Regret bounds, Learning theory

I didn’t put it in the diagram above, but Vanessa has something much more specific in mind: the “formal guarantee” is a performance guarantee.

Regret bounds are the main type of performance guarantee that we’re talking about. The generic setup here is: you have an algorithm that issues certain outputs, and those outputs are scored by some objective function. The goal is to maximize the score. “Regret” is the difference between your algorithm’s score and the score of the best possible algorithm. So higher regret is worse, and a “regret bound” puts a threshold on how high the regret can get.

(In the context of a Bayesian agent—i.e., an algorithm that starts with a “prior” set of possible hypotheses describing how the world works, and one of these hypotheses is true—then you can define the regret as how much worse your algorithm does than the algorithm which 100% believes the true hypothesis from the start. Much more on Bayesian agents in a later section.)

By the way, how do you turn a performance guarantee into a safety guarantee? By choosing appropriate desiderata! To take a very stupid unrealistic example, if the objective is “don’t kill anyone”, then if you can guarantee that your AI algorithm will score well on the objective, then you can guarantee that your AI algorithm won’t kill too many people! (This is not a real example; we’ll get to an actual example later—Section 4.1.)

If you’re not familiar with regret bounds, I strongly recommend reading at least a bit about them, just to get a sense of what we’re talking about. Regret bounds in the context of “bandit problems” are discussed for a few pages in Russell & Norvig (4th Ed.), or they’re discussed much more casually in the pop-science book Algorithms to Live By, or they’re discussed much less casually in this 130-page review article. (The latter is on the Vanessa Research Agenda Reading List.)

Another terminology note: Vanessa frequently talks about learning theory—for example, she calls it “The learning-theoretic AI alignment research agenda”. Here’s Vanessa explaining the terminology for us:

Usually “statistical learning theory” denotes the study of sample complexity /​ regret bounds disregarding computational complexity constraints whereas “computational learning theory” refers to the study of sample complexity /​ regret bounds under computational complexity constraints. I refer to them jointly as “learning theory” for the sake of brevity and simplicity.

3.2 Path to formal guarantees

One strand of Vanessa’s work is building new, better foundations for learning algorithms with formal guarantees, pushing forward the state-of-the-art, especially in areas that seem like they’ll be relevant for AGI. Here’s a little map of a couple of the relevant aspects:

I haven’t defined any of these terms yet, but hang on, we’ll get there.

3.2.1 Traps

A trap is an irreversibly bad action—like Pressing The History Eraser Button.

In Vanessa’s paper Delegative Reinforcement Learning: Learning to Avoid Traps with a little help, she defined a “trap” more precisely as “a state which, once reached, forces a linear lower bound on regret”. That’s a fancy way of saying that, after stepping into the trap, you not only won’t score the maximum number of points per move after that, but you can’t even approach the maximum number of points per move, never ever, no matter what you do after that.

(Note for clarification: If you’re trying to maximize your score, then a lower bound on regret is an upper bound on your score, and is therefore generally bad news. By contrast, the phrase “regret bounds” by itself usually means “regret upper bounds”, which are generally good news.)

The premise of that Vanessa paper is that “most known regret bounds for reinforcement learning are either episodic [i.e. you keep getting to rewind history and try again] or assume an environment without traps. We derive a regret bound without making either assumption…” Well that’s awesome! Because those assumptions sure don’t apply to the real world! Remember, we’re especially interested in super-powerful futuristic AIs here. Taking over the world and killing everyone would definitely count as a “trap”, and there’s no do-over if that happens!

Why do the known regret bounds in the literature typically assume no traps? Because if we don’t want the agent to forever miss out on awesome opportunities, then it needs to gradually figure out what the best possible states and actions are, and the only way to do so is to just try everything a bunch of times, thus thoroughly learning the structure of its environment. So if there’s a trap, that kind of agent will “learn about it the hard way”.

And then how does Vanessa propose to avoid that problem?

Vanessa considers a situation with a human and an AI, who have the same set of possible actions. So think “the human can press buttons on the keyboard, and the AI can also ‘press’ those same keyboard buttons through an API”. Do not think “the human and the robot are in the kitchen cooking onion soup”—because there the AI has actions like “move my robot arm” and the human has actions like “move my human arm”. These are not exactly the same actions. Well, at least they’re not straightforwardly the same—maybe they can be mapped to each other, but let’s not open that can of worms! Also, in the onion soup situation, the human and robot are acting simultaneously, whereas in the Vanessa “shared action space” situation they need to take turns.

Along these lines, Vanessa proposes that at each step, the AI can either take an action, or “delegate” to the human to take an action. She makes the assumption that when the human is in control, the human is sufficiently competent that they (1) avoid traps, and (2) at least sometimes take optimal actions. The AI then performs the best it can while basically restricting itself to actions that it has seen the human do, delegating to the human less and less over time. Her two assumptions above ensure that her proposed AI algorithm both avoids traps and approaches optimal performance.

Whoops! I just wrote “restricting itself to actions that it has seen the human do”, but Vanessa (assuming a Bayesian agent here—see later section), offers the following correction:

More precisely, [the AI] restricts itself to actions that it’s fairly confident the human could do in this situation (state). The regret bound scales as number of hypotheses, not as number of states (and it can be generalized to scale with the dimension of the hypothesis space in an appropriate sense)”

That’s just one paper; Vanessa also has various other results and ideas about avoiding traps (see links here for example). But I think this is enough to get a taste. Let’s keep moving forward.

3.2.2 Good “low-stakes” performance

Once we solve the traps problem as discussed above, the rest of the problem is “low-stakes”. This is a Paul Christiano term, not a Vanessa term, but I like it! It’s defined by: “A situation is low-stakes if we care very little about any small number of decisions.” Again, we’re avoiding traps already, so nothing irreversibly bad can happen; we just need to hone the algorithm to approach very good decisions sooner or later (and preferably sooner).

I think Vanessa’s perspective is that proving results about low-stakes performance hinges on the concept of “realizability”, so let’s turn to that next.

3.2.3 The realizability problem (a.k.a. “grain-of-truth”)

Here I need to pause and specify that Vanessa generally thinks of her AIs as doing Bayesian reasoning, or at least something close to Bayesian reasoning (more on which shortly). So it has some collection of hypotheses (≈ generative models) describing its world and its objective function. It assigns different credences (=probabilities) to the different hypotheses. Every time it gets a new observation, it shifts around the credences among its different hypotheses, following Bayes’ rule.

Whoa, hang on a second. When I think of reinforcement learning (RL) agents, I think of lots of very different algorithms, of which only a small fraction are doing anything remotely like Bayesian reasoning! For example, Deep Q Learning agents sure don’t look Bayesian! So is it really OK for Vanessa to (kinda) assume that our future powerful AIs will be more-or-less Bayesian? Good question! I actually think it’s quite a problematic assumption, for reasons that I’ll get back to in Section 6.3.

Vanessa replies: “I am not making any strong claims about what the algorithms are doing. The claim I’m making is that Bayesian regret bounds is a useful desideratum for such algorithms. Q-learning certainly satisfies such a bound (about deep Q we can’t prove much because we understand deep learning so poorly).”
My attempted rephrasing of Vanessa’s reply: Maybe we won’t actually use “a Bayesian agent with prior P and decision rule D” (or whatever) as our AGI algorithm, because it’s not computationally optimal. But even so, we can still reason about what this algorithm would do. And whatever it would do, we can call that a “benchmark”! Then we can (1) prove theorems about this benchmark, and (2) prove theorems about how a different, non-Bayesian algorithm performs relative to that benchmark. (Hope I got that right!) (Update: Vanessa responds/​clarifies again in the comment section.)

Let’s just push forward with the Bayesian perspective. As discussed in this (non-Vanessa) post, Bayesian updating works great, and converges to true beliefs, as long as the true hypothesis is one of the hypotheses in your collection of hypotheses under consideration. (Equivalently, “the true underlying environment which is generating the observations is assumed to have at least some probability in the prior.”) This is called realizability. In realizable settings, the agent accumulates more and more evidence that the true hypothesis is true and that the false hypotheses are false, and eventually it comes to trust the true hypothesis completely. Whereas in non-realizable settings, all bets are off! Well, that’s an exaggeration, you can still prove some things in the non-realizable setting, as discussed in that post. But non-realizability definitely messes things up, and seems like it would get in the way of the performance guarantees we want.

Unfortunately, in complex real-world environments, realizability is impossible. Not only does the true generative model of the universe involve atoms, but also some of those atoms comprise the computer running the AI itself—a circular reference!

It seems to me that the generic way to make progress in a non-realizable setting is to allow hypotheses that make predictions about some things but remain agnostic about others. I’ve seen this solution in lots of places:

  • Scientific hypotheses—if you treat The Law Of Conservation Of Energy as a “hypothesis”, and you “ask” Conservation Of Energy what the half-life of tritium is, then Conservation Of Energy will tell you “Huh? How should I know? Why are you asking me?” In general, if a scientific hypothesis makes a prediction that’s falsified by the data, that’s a big problem, and we may have to throw the hypothesis out. Whereas if a scientific hypothesis refrains from making any prediction whatsoever about some data, that’s totally fine! (NB: When evaluating a hypothesis, we do weigh how many correct predictions the hypothesis makes, versus how complex the hypothesis is. But my point is: nobody is counting how many predictions the hypothesis does not make. Nobody is saying “Pfft, yeah man, I don’t know about this scientific hypothesis—it doesn’t predict plate tectonics, and it doesn’t predict the mating dance of hummingbirds, and it doesn’t predict where I left my keys, and …”)

  • Prediction markets—In a healthy prediction market, we would expect specialization, where any given trader develops an expertise in some area (e.g. Brazilian politics), does a lot of trading on contracts in that area, and declines to bid on unrelated areas. You can treat each trader as a type of hypothesis /​ belief, and thus we get a framework for building predictions out of lots of hypotheses that are individually agnostic about most things. And indeed, prediction markets can be used as a framework for epistemology—this is the idea of Vovk-Shafer-Garrabrant Induction (a.k.a. logical induction).

  • Bayes nets, neocortex—There’s at least a vague sense in which each individual edge of a Bayes net “has an opinion” about the truth value of the two nodes that it connects to, while being agnostic about the truth value of all the other nodes. I think the neocortex has something similar going on too. For example, I can imagine a “stationary purple rock” but not a “stationary falling rock”. Why? The concepts “stationary” and “falling” are incompatible because they are expressing mutually-contradictory “opinions” about the same variables. By contrast, “stationary” and “purple” are compatible—they’re mostly making predictions about non-overlapping sets of variables. So I think of all this stuff as at least vaguely prediction-market-ish, but maybe don’t take that too literally.

  • Infra-Bayesianism—Vanessa’s signature approach! That’s the next subsection.

(Side note: I’m suggesting here that the prediction market approach and the infra-Bayesianism approach are kinda two solutions to the same non-realizability problem. Are they equivalent? Complementary? Is there a reason to prefer one over the other? I’m not sure! For my part, I’m partial to the prediction market approach because I think it’s closer to how the neocortex builds a world-model, and therefore almost definitely compatible with AGI-capable algorithms. By contrast, infra-Bayesianism might turn out to be computationally intractable in practice. Whatever, I dunno, I guess we should continue to work on both, and see how they turn out. Infra-Bayesianism could also be useful for conceptual clarity, proving theorems, etc., even if we can’t directly implement it in code. Hmm, or maybe that’s the point? Not sure.)

Vanessa reply #1: [The prediction market approach] is a passive forecasting setting, i.e. the agent just predicts/​thinks. In a RL setting, it’s not clear how to apply it at all. Whereas infra-Bayesianism [IB] is defined for the RL setting which is arguably what we want. Hence, I believe IB is the correct thing or at least the more fundamental thing.
Vanessa reply #2: I see no reason at all to believe IB is less [computationally] tractable than VSG induction. Certainly it has tractable algorithms in some simple special cases.

3.2.4 Infra-Bayesianism

I guess the authoritative reference on this topic is The Infra-Bayesianism Sequence, by the inventors of infra-Bayesianism, Vanessa & Alex Appel (superhero name username: “Diffractor”). For more pedagogy and context see infra-Bayesianism Unwrapped by Adam Shimi or Daniel Filan’s podcast interview with Vanessa.

The starting point of infra-Bayesianism is that a “hypothesis” is a “convex set of probability distributions”, rather than just one probability distribution. On the podcast Vanessa offers a nice example of why we might do this:

Suppose your world is an infinite sequence of bits, and one hypothesis you might have about the world is maybe all the odd bits are equal to zero. This hypothesis doesn’t tell us anything about even bits. It’s only a hypothesis about odd bits, and it’s very easy to describe it as such a convex set of probability distributions over all of the bits. We just consider all probability distributions that predict that the odd bits will be zero with probability 1, and without saying anything at all about the even bits. They can be anything. They can even be uncomputable. You’re not trying to have a prior over the even bits. (lightly edited /​ excerpted from what Vanessa said here)

So, just as in the previous section, the goal is to build hypotheses that are agnostic about some aspects of the world (i.e., they refuse to make any prediction one way or the other). Basically, if you want your hypothesis to be agnostic about Thing X, you take every possible way that Thing X could be and put all of the corresponding probability distributions into your convex-set-of-probability-distributions. For example, the completely oblivious hypothesis (“I have no knowledge or opinion about anything whatsoever, leave me alone”) would be represented as the set of every probability distribution.

Then Vanessa & Diffractor did a bunch of work on the mathematical theory of these convex-sets-of-probability-distributions—How do you update them using Bayes’ rule? How do you use them to make decisions? etc. It all turns out to be very beautiful and elegant, but that’s way outside the scope of this post.

3.3 Formal guarantees: wrapping up

Importantly—and I’ll get back to this in Section 6—Vanessa doesn’t think of “algorithms with formal guarantees” as very strongly tied to AGI safety in particular—lots of people work on RL regret bounds who don’t give a fig about AGI existential risk, and indeed she sees this type of research as equally relevant to AGI capabilities. Nevertheless, her “traps” and “infra-Bayesianism” work is geared towards filling in a couple conceptual gaps that would otherwise prevent the kind of formal guarantees that we need for her path to AGI safety. (She has also done much more along these lines that I’m leaving out, e.g. formal guarantees when the reward information is not easily accessible, which is obviously relevant for running AIs off human feedback.)

4. Desiderata

As above, Vanessa’s two ingredients to AGI safety are (1) desiderata and (2) algorithms with formal (performance-related) guarantees that they will satisfy the desiderata. The previous section was about the guarantees, so now we switch over to the desiderata.

I don’t have much to say in general here. Instead let’s just work through an example, as an illustration of the kind of thing she has in mind.

4.1 Example: “the Hippocratic principle” desideratum, and an algorithm that obeys it

See Vanessa’s short post on this in March 2021. I’ll leave aside the math here and just copy her informal description of the desideratum:

I propose a new formal desideratum for alignment: the Hippocratic principle. Informally the principle says: an AI shouldn’t make things worse compared to letting the user handle them on their own, in expectation w.r.t. the user’s beliefs. … This principle can be motivated as follows. Suppose your options are (i) run a Hippocratic AI you already have and (ii) continue thinking about other AI designs. Then, by the principle itself, (i) is at least as good as (ii) (from your subjective perspective).

After more precisely defining that idea, Vanessa then proposes an algorithm, which she calls Hippocratic Timeline-Driven Learning, which would satisfy that desideratum while still potentially outperforming humans. Here’s my intuitive description of how it works:

The AI is trying to help the human H. As in the Delegative RL discussion above (Section 3.2.1), AI & H share access to a single output channel, e.g. a computer keyboard, so that the actions that H can take are exactly the same as the actions AI can take. Every step, AI can either take an action, or delegate to H to take an action.

Meanwhile, H has a task that she’s trying to do, e.g. “solve the whole technical AGI safety problem”. And every step, H reports her current guess as to the “timeline” for eventual success or failure at the task, if she continues with no more AI help from that moment on. For example, maybe H periodically fills in a chart that says: “In the next 1 hour, I’m 0.01% likely to succeed and 0.01% likely to fail and 99.98% likely to be still working on it. And in the next 8 hours, I’m … And in the next month, I’m …”.

At first, AI will probably delegate to H a lot, and by watching H work, AI will gradually learn both the human policy (i.e. what H tends to do in different situations), and how different actions tend to turn out in hindsight from H’s own perspective (e.g., maybe whenever H takes action 17, she tends to declare shortly afterwards that probability of success now seems much higher than before—so really H should probably be taking action 17 more often!).

Presumably the AI, being a super duper fancy AI algorithm, learns to anticipate how different actions will turn out from H’s perspective much better than H herself. In other words, maybe it delegates to H, and H takes action 41, and the AI is watching this and shaking its head and thinking to itself “Gee you dunce you’re gonna regret that”, and shortly thereafter the AI is proven correct.

OK, so now what? The naïve answer would be: the AI should gradually stop delegating and start just doing the thing that leads to H feeling maximally optimistic later on.

But we don’t want to do that naïve thing. There are two problems:

The first problem is “falling into traps (a.k.a. catastrophes) in the course of random exploration”. Vanessa’s solution is Delegative RL (DRL) as discussed above. In the explore/​exploit dilemma, the AI kinda sits back and lets H do the exploring, at least when a trap might be involved.

But we’re not done: The second problem is that we’re not assuming that the human will always avoid traps, so we might fall into a trap even while sticking to what the human might do. Remember, we’re trying to follow the Hippocratic principle: the human might fail, and the AI might fail too, we just don’t want the AI to fail with higher probability than the human. For example, if we were doing DRL, and the AI delegated to H, and H took the opportunity to play Russian Roulette (and didn’t lose), we don’t want the AI to now feel free to sign H up for Russian Roulette over and over.

This is a tricky problem! Vanessa solves it with a variant on quantilization. If you haven’t heard of it, quantilization is kinda like “optimization but less so”. An optimizing agent might explore 1000 possible courses of action and pick the best. By contrast, a quantilizing agent might explore 1000 possible courses of action, pick the best 100 of those, and choose randomly from that subset.

So let’s say that while H might or might not be optimistic about eventual success in the task—this is a very difficult task after all!—at least she’s optimistic that she won’t fail soon. Maybe she assigns a 99% chance of not failing within the next hour. Therefore—assuming that H is well-calibrated in regards to traps—it follows that H’s own policy will lead to a trap in the next hour with ≤1% probability. (Maybe H figures she might play a round of Russian roulette in the next hour, but probably won’t.)

An optimizing AI might, for all we know, choose a policy with 100% chance of Russian roulette—a much worse policy than the human’s! But a quantilizing AI can do better than that. Let’s say the AI quantilizes by drawing 5 random samples from (its understanding of) H’s policy for the next hour (i.e. the probability distribution for what H might do without AI assistance). Then the AI takes the action that (it believes) will lead to the highest future-optimism-about-success, out of those 5 options. (Side note: I’m assuming in this description that the AI’s beliefs are accurate. In practice they won’t be, but we will have an upper bound on how inaccurate they are, which is good enough for the proof to go through.)

Compared to just delegating to H, this quantilized action has advantages and disadvantages. The disadvantage is: we can no longer say that the probability of falling into traps in that hour is definitely ≤1%. The probability could now be as high as ≤5%, for example if H’s dangerous Russian Roulette habit is misjudged by the AI as safe and desirable. The advantage is, in the other ≥95% of the probability mass, the AI is probably doing something much better than what H would have done. The AI can calculate both these things, and it can potentially prove that the advantages outweigh the disadvantages—i.e., that H’s success probability will be higher if the AI takes charge of the keyboard for the next hour, rather than delegates to H. Then after that hour is up, we can just repeat this process.

5. One more thing: Non-Cartesian daemons

I left something out of the diagram above—here’s a better version, with a new thing on the right.

Algorithms generally have an “intended” /​ “Cartesian” output channel (e.g. the AI passes its outputs into an API that displays them on the computer monitor), and also “unintended” /​ “non-Cartesian” output channels (e.g. the AI might exploit an operating system bug to break out of its sandbox and directly send messages over the internet, or maybe it manipulates RAM to send radio signals, or whatever).

Vanessa imagines that both the desiderata and the formal guarantee will refer exclusively to the information that goes out the intended output channel. So we need some other method to rule out dangerous things going out of unintended output channels.

Vanessa discusses non-Cartesian daemons in her post The Learning-Theoretic AI Alignment Research Agenda. She suggests that in principle we can solve this problem by running the AI algorithm under homomorphic encryption. The problem is that homomorphic encryption is extremely computationally inefficient. She discusses other things too. Do click that link; I don’t have anything to add to her discussion. My complaints below will focus on the intended /​ Cartesian output channel, which I consider the harder part of the technical AGI safety problem.

6. Switching to “my take”

Everything so far has been mostly “Steve trying to get Vanessa’s perspective”. Now we change gears and talk more about my perspective, and why my perspective suggests that Vanessa’s perspective may be missing important things.

I’ll pause and say that while I think there is a substantive disagreement in here, there’s also probably some amount of this dynamic:

We can debate the exact percentages here, but the fact is that nobody today has a research plan that will definitely get us to safe AGI. God knows I don’t! So “probably won’t succeed” is to be strongly distinguished from “shouldn’t bother trying”. Vanessa is doing great work, and more power (& resources!) to her. And that’s even leaving aside the many obvious reasons that everyone should trust Vanessa’s instincts over mine, not to mention the thick fog of general uncertainty about what AGI algorithms can and will look like.

(I’ve previously speculated that much of my “disagreement” with proponents of “factored cognition” is like this as well.)

6.1 Algorithms-first vs Desiderata-first: Redux!

In an earlier section, I put this diagram:

If that diagram was all there was to it, then I think it wouldn’t be that big a difference—more like two methods of brainstorming a solution, and they basically wind up in the same place. After all, if you’re working on the left side of the diagram, and you manage to prove some theorem about your algorithm having some desirable property, well then you stamp the word “desideratum” onto the statement you just proved, and bam, now you’re on the right side of the diagram! Conversely, if you’re nominally working on the right side, in practice you’ll still have particular algorithms and failure modes in the back of your mind, to help you pick desiderata that are both sufficient and satisfiable.

Instead, I think there’s an important difference that’s not conveyed in the diagram above, and it’s related to Vanessa’s more specific assumptions about what the desiderata and guarantees will look like. Here’s my attempt to call out one important aspect of that:

The box at the bottom is a bunch of “weird” failure modes that are discussed almost exclusively by the community of people who work on AGI x-risk. The arrows describe “connection strength”—like, if you read a blog post in any of the upper categories, how many times per paragraph would that post explicitly refer to one of these failure modes?

And in this figure, unlike the above, we see a very clear difference between the two sides. These no longer look like two paths to the same place! Both paths could (with a whole lot of luck!) wind up proving that an AGI algorithm will be “safe” in some respect, but even if they did, the resulting proofs would not look the same. On the left, the proofs would involve many pages of explicit discussion about all these weird failure modes, and formalizations and generalizations of them, and how such-and-such algorithm will be provably immune to those failure modes. On the right, these weird failure modes would be barely mentioned!

Vanessa replies: I definitely find these weird failure modes useful for thinking whether my models truly capture everything they should capture. For example, I would often think how such-and-such approach would solve such-and-such failure mode to see whether I inadvertently assumed the problem away somewhere.

In particular, I think the way Vanessa imagines it is: (1) the “formal guarantees” part would have no explicit connection to those considerations, and (2) the desiderata would rule out these problems, but only via a funny indirect implicit route. For example, the instrumental convergence argument says that it is difficult to make advanced AI algorithms “corrigible”; for example, they tend to seize control of their off-switches. A Vanessa-style indirect argument that an algorithm is corrigible might look like this:

Here’s the desideratum, i.e. the thing we want the algorithm to do, and if the algorithm does that or something very close to that, then it will be safe. And our regret bound proves that the algorithm will indeed do something very close to that thing. Will the algorithm seize control of its off-switch? No, because that would result in bad scores on the objective, in violation of our proven regret bound. Bam, we have now solved the corrigibility problem!!

I find this suspicious. Whenever I think about any actual plausibly-AGI-capable algorithm (brain-like algorithms, deep neural networks, rational agents), I keep concluding that making it provably corrigible is a really hard problem. Here we have this proposed solution, in which, ummm, well … I dunno, it feels like corrigibility got magically pulled out of a hat. It’s not that I can point to any particular flaw in this argument. It’s that, it seems too easy! It leaves me suspicious that the hard part of the corrigibility problem has been swept under the rug, somehow. That I’ve been bamboozled! “Pay no attention to that man behind the curtain!”

Vanessa replies: Let me give an analogy. In cryptography, we have theorems saying that if such-and-such mathematical assumption holds (e.g. X is a one-way function) then a particular protocol is sound. We don’t need to list all possible ways an attacker might try to break the protocol: we get safety from any possible attack! (within the assumptions of the model) Is this “too easy”? I don’t think so: it requires a lot of hard work to get there, and we’re still left with assumptions we don’t know how to prove (but we do have high confidence in). Similarly, we’re going to need a lot of hard work to get safe protocols for AGI.

But that’s not very helpful. Can I be more specific?

One thing is: the CS literature is full of regret bound proofs (including some by Vanessa), and if you look at any of them, they look like bog-standard statistical learning theory. They don’t talk about wireheading and treacherous turns and all that weird stuff. As we improve our AI algorithms, Vanessa assumes that we’ll just see this pattern continue—same “genre” of proof, but applied to the ever-more-capable learning algorithms in the future. Whereas my hunch, as discussed below, is that this pattern will change: For future AGI-capable learning algorithms, if we can prove regret bounds at all (a big “if”!), then the proofs would necessarily have to go on and on about all the weird failure modes that x-risk people talk about.

Why? Suppose that (to take another example) the problem of ontological crises is real, and suppose that it is only solvable (in some class of advanced AI) with a specific “ontology-controlling algorithm feature” (something something model splintering symbol-grounding blah blah). And suppose that the manifestation of this failure mode is poor performance on the objective. (It could also manifest in other ways, like “its outputs manipulate the programmer into changing the objective, or ending the training”, but in the desideratum/​guarantee framework, we would presumably somehow re-define the objective to capture whatever the problem is.) In that case, there can be no general proof that this type of AI will satisfy a good regret bound. There can only be a proof that this type of AI will satisfy a good regret bound when the AI incorporates the “ontology-controlling algorithm feature”. And therefore the proof will necessarily involve an explicit discussion of that feature, and hence an explicit discussion of the ontological crisis problem.

To make this more concrete and helpful, I want to drill down a bit deeper into how these failure modes might manifest, and relate that to Vanessa’s picture. I’ll take the “Hippocratic Timeline-Driven Learning” approach above (Section 4.1) as an example.

Before we start, I want to avoid a potential point of confusion from two types of “outputs” in this protocol:

Distinguishing two categories of “outputs” in Hippocratic Timeline-Dependent Learning, for upcoming discussion

On the left is the sophisticated AI with deep knowledge of the world etc. From its perspective, there is a pair of outputs: (1) a probability distribution for what course of action the human will take next (or is it a probability distribution of probability distributions? Whatever, doesn’t matter here.), and (2) conditional on the human taking a certain course of action, a probability distribution for the human’s later assessment of success. Creating each of these two outputs is a supervised learning problem—we get the ground truth in hindsight. These intermediate outputs then go into Vanessa’s simple short-term-quantilizer algorithm, which then takes an action. From the system’s perspective, that’s the “actual output”, i.e. the output going to the user, and helping them or hurting them or manipulating them or whatever.

6.2 Relation to inner misalignment

So far, we have a sophisticated AI which produces a pair of outputs. Creating those outputs is a supervised learning problem, and if it does well on the problem (i.e. if the pair of outputs is accurate), then AFAICT the system is totally safe (based on Vanessa’s arguments above, and assuming homomorphic encryption to deal with non-Cartesian daemons as discussed in Section 5 above). Specifically, we can get superhuman outputs while remaining totally sure that the outputs are not dangerous or manipulative, or at least no more dangerous and manipulative than the human working without an AI assistant.

So I would describe this system as being outer aligned: it has an objective function (accuracy of those two outputs), and if it scores sufficiently well on that objective, then it’s doing something we want it to do. (And it’s an especially cool kind of outer alignment, in that it’s not just safe in the limit of perfect performance, but also allows us to prove safety in the boundedly-imperfect case, e.g. “if we can prove that regret is ≤0.01, then we can prove that the algorithm is safe”.)

The sibling of outer alignment is inner alignment. “Inner alignment” is a concept that only applies to algorithms that are “trying to do” something—usually in the sense of consequentialist planning. Then “inner alignment” is when the thing they’re “trying to do” is to maximize the objective function, and “inner misalignment” is when they’re “trying to do” something else.

The Vanessa philosophy, I think, is to say “I don’t care whether the algorithm is “inner aligned” or not—I’m going to prove a regret bound, and then I know that the system is outputting the right thing. That’s all I need! That’s all I care about!”

And then my counterpoint would be: “If there’s a class of algorithms with an inner alignment problem, then you can try to prove regret bounds on those algorithms, but you will fail, until you grapple directly with the inner alignment problem.”

Before proceeding further, just as background, I’ll note that there are two versions of the inner misalignment problem:

  • Mesa-optimizers: In the more famous version, there’s a big black box, and consequentialist planning appears inside that box as a result of gradient descent on the whole black box. Here, the consequentialist planner (a.k.a. “mesa-optimizer”) may wind up with the wrong goal because this whole thing is being designed by gradient descent with no human intervention, and it’s prone to set things up so as to optimize a proxy to the objective, for example.

  • Steered optimizers: In this other version (more like model-based RL; see here or here), humans program a consequentialist planning framework into an AI, but one ingredient in that framework is a big complicated unlabeled world-model that’s learned from data. Here, the consequentialist planner (a.k.a. “steered optimizer”) may wind up with the wrong goal, because its goal is defined in its world-model, and our intended goal is not, at least not initially. In order to push that goal into the AI’s world-model, we need to write code to solve the credit-assignment /​ symbol-grounding problem, which typically involves heuristics that don’t work perfectly, among other problems.

(The former is analogous to the development of a human brain over evolutionary time—i.e. the “outer objective” was “maximize inclusive genetic fitness”. The latter is analogous to within-lifetime learning algorithm involving the neocortex, which is “steered” by reward and other signals coming from the brainstem and hypothalamus—i.e. the “outer objective” is some horribly complicated genetically-hardcoded function related to blood glucose levels and social status and sexual intercourse and thousands of other things. Neither of these “outer objectives” is exactly the same as “what a human is trying to do”, which might be “use contraceptives” or “get out of debt” or whatever.)

6.3 Why is the algorithm “trying to do” anything? What’s missing from the (infra)Bayesian perspective?

The (infra)Bayesian perspective is basically this: we have a prior (set of hypotheses and corresponding credences) about how the world is, and we keep Bayesian-updating the prior on new evidence, and we gradually wind up with accurate beliefs. Maybe the hypotheses get “refined” sometimes too (i.e. they get split into sub-hypotheses that make predictions in some area where the parent hypothesis was agnostic).

There seems to be no room in this Bayesian picture for inner misalignment. This is just a straightforward mechanical process. In the simplest case, there’s literally just a list of numbers (the credence for each hypothesis), and with each new observation we go through and update each of the numbers in the list. There does not seem to be anything “trying to do” anything at all, inside this type of algorithm.

Vanessa replies: Quite the contrary. Inner misalignment manifests as the possibility of malign hypotheses in the prior. See also Formal Solution to the Inner Alignment Problem and discussion in comment section there.

So if we can get this kind of algorithm to superhuman performance, it might indeed have no inner misalignment problem. But my belief is that we can’t. I suspect that this kind of algorithm is doomed to hit a capability ceiling, far below human level in the ways that count.

Remember, the goal in Hippocratic Timeline-Dependent Learning is that the algorithm predicts the eventual consequences of an action, better than the human can. Let’s say the human is writing a blog post about AI alignment, and they’re about to start typing an idea. For the AI to succeed at its assigned task, the AI needs to have already, independently, come up with the same original idea, and worked through its consequences.

To reach that level of capability, I think we need a different algorithm class, one that involves more “agency” in the learning process. Why?

Let’s compare two things: “trying to get a good understanding of some domain by building up a vocabulary of concepts and their relations” versus “trying to win a video game”. At a high level, I claim they have a lot in common!

  • In both cases, there are a bunch of possible “moves” you can make (you could think the thought “what if there’s some analogy between this and that?”, or you could think the thought “that’s a bit of a pattern; does it generalize?”, etc. etc.), and each move affects subsequent moves, in an exponentially-growing tree of possibilities.

  • In both cases, you’ll often get some early hints about whether moves were wise, but you won’t really know that you’re on the right track except in hindsight.

  • And in both cases, I think the only reliable way to succeed is to have the capability to repeatedly try different things, and learn from experience what paths and strategies are fruitful.

Therefore (I would argue), a human-level concept-inventing AI needs “RL-on-thoughts”—i.e., a reinforcement learning system, in which “thoughts” (edits to the hypothesis space /​ priors /​ world-model) are the thing that gets rewarded. The human brain certainly has that. You can be lying in bed motionless, and have rewarding thoughts, and aversive thoughts, and new ideas that make you rethink something you thought you knew.

Vanessa replies: I agree. That’s exactly what I’m trying to formalize with my Turing RL setting. Certainly I don’t think it invalidates my entire approach.

(As above, the AI’s RL-on-thoughts algorithm could be coded up by humans, or it could appear within a giant black-box algorithm trained by gradient descent. It doesn’t matter for this post.)

Unfortunately, I also believe that RL-on-thoughts is really dangerous by default. Here’s why.

Again suppose that we want an AI that gets a good understanding of some domain by building up a vocabulary of concepts and their relations. As discussed above, we do this via an RL-on-thoughts AI. Consider some of the features that we plausibly need to put into this RL-on-thoughts system, for it to succeed at a superhuman level:

  • Developing and pursuing instrumental subgoals—for example, suppose the AI is “trying” to develop concepts that will make it superhumanly competent at assisting a human microscope inventor. We want it to be able to “notice” that there might be a relation between lenses and symplectic transformations, and then go spend some compute cycles developing a better understanding of symplectic transformations. For this to happen, we need “understand symplectic transformations” to be flagged as a temporary sub-goal, and to be pursued, and we want it to be able to spawn further sub-sub-goals and so on.

  • Consequentialist planning—Relatedly, we want the AI to be able to summon and re-read a textbook on linear algebra, or mentally work through an example problem, because it anticipates that these activities will lead to better understanding of the target domain.

  • Meta-cognition—We want the AI to be able to learn patterns in which of its own “thoughts” lead to better understanding and which don’t, and to apply that knowledge towards having more productive thoughts.

Putting all these things together, it seems to me that the default for this kind of AI would be to figure out that “seizing control of its off-switch” would be instrumentally useful for it to do what it’s trying to do (i.e. develop a better understanding of the target domain, presumably), and then to come up with a clever scheme to do so, and then to do it. So like I said, RL-on-thoughts seems to me to be both necessary and dangerous.

Where do Vanessa’s delegative RL (DRL) ideas fit in here? Vanessa’s solution to “traps”, as discussed above, is to only do things that a human might do. But she proposes to imitate human outputs, not imitate the process by which a human comes to understand the world. (See here—we were talking about how the AI would have “a superior epistemic vantage point” compared to the human.) Well, why not build a world-model from the ground up by imitating how humans learn about the world? I think that’s possible, but it’s a very different research program, much closer to what I’m doing than to what Vanessa is doing.

6.4 Upshot: I still think we need to solve the (inner) alignment problem

As mentioned above, in the brain-like “steered optimizer” case I normally think about (and the “mesa-optimizer” case winds up in a similar place too), there’s a giant unlabeled world-model data structure auto-generated from data, and some things in the world-model wind up flagged as “good” or “bad” (to oversimplify a bit—see here for more details), and the algorithm makes and evaluates and executes plans that are expected to lead to the “good” things and avoid the “bad” things.

“Alignment”, in this context, would refer to whether the various “good” and “bad” flags in the world-model data structure are in accordance with what we want the algorithm to be trying to do.

Note that “alignment” here is not a performance metric: Performance is about outputs, whereas these “good” and “bad” flags (a.k.a. “desires” or “mesa-objectives”) are upstream of the algorithm’s outputs.

Proving theorems about algorithms being aligned, and remaining aligned during learning and thinking, seems to me quite hard. But not necessarily impossible. I think we would need some kind of theory of symbol-grounding, along with careful analysis of tricky things like ontological crises, untranslatable abstract concepts, self-reflection, and so on. So I don’t see Vanessa’s aspirations to prove theorems as a particularly important root cause of why her research feels so different than mine. I want the same thing! (I might not get it, but I do want it…)

And also as discussed above, I don’t think that “desiderata-first” vs “algorithms-first” is a particularly important root cause of our differences either.

Instead, I think the key difference is that she’s thinking about a particular class of algorithms (things vaguely like “Bayesian-updating credences on a list of hypotheses”, or fancier elaborations on that), and this class of algorithms seems to have a simple enough structure that the algorithms lack internal “agency”, and hence (at least plausibly) have no inner alignment problems. If we can get superhuman AGI out of this class of algorithms, hey that’s awesome, and by all means let’s try.

Vanessa objects to my claim that “she’s thinking about a particular class of algorithms”: I’m not though. I’m thinking about a particular class of desiderata.

However, of the two most intelligent algorithms we know of, one of them (brains) definitely learns a world-model in an agent-y, goal-seeking, “RL-on-thoughts” way, and the other (deep neural nets) at least might start doing agent-y things in the course of learning a world-model. That fact, along with the video game analogy discussion above (Section 6.3), makes me pessimistic about the idea that the kinds of non-agent-y algorithms that Vanessa has in mind will get us to human-level concept-inventing ability, or indeed that there will be any straightforwardly-safe way to build a really good prior. I think that building a really good prior takes much more than just iteratively refining hypotheses or whatever. I think it requires dangerous agent-y goal-seeking processes. So we can’t skip past that part, and say that the x-risks arise exclusively in the subsequent step of safely using that prior.

Instead, my current perspective is that building a really good prior (which includes new useful concepts that humans have never thought of, in complex domains of knowledge like e.g. AGI safety research itself) is almost as hard as the whole AGI safety problem, and in particular requires tackling head-on issues related to the value-loading, ontological crises, wireheading, deception, and all the rest. The agent-y algorithms that can build such priors may have no provable regret bounds at all, or if they do, I think the proof of those regret bounds would necessarily involve a lot of explicit discussion of these alignment-type topics.

So much for regret bounds. What about desiderata? While a nice clean outer-aligned objective function, like the example in Section 4.1 above, seems possibly helpful for AGI safety, I currently find it somewhat more likely that after we come up with techniques to solve these inner alignment problems, we’ll find that having a nice clean outer-aligned objective function is actually not all that helpful. Instead, we’ll find that the best approach (all things considered) is to use those same inner-alignment-solving techniques (whatever they are) to solve inner and outer alignment simultaneously. In particular, there seem to be approaches that would jump all the way from “human intentions” to “the ‘good’ & ‘bad’ flags inside the AGI’s world-model” in a single step—things like interpretability, and corrigible motivation, and “finding the ‘human values’ concept inside the world-model and manually flagging it as good”. Hard to say for sure; that’s just my hunch.