[Intro to brain-like-AGI safety] 14. Controlled AGI

Steven Byrnes11 May 2022 13:17 UTC

LW: 47 AF: 15

(Last revised: January 2026. See changelog at the bottom.)

14.1 Post summary / Table of contents

Part of the “Intro to brain-like-AGI safety” post series.

Post #12 suggested two paths forward for solving “the alignment problem” for brain-like AGI, which I called “Social-instinct AGI” and “Controlled AGI”. Then Post #13 went into more detail about (one aspect of) “Social-instinct AGI”. And now, in this post, we’re switching over to “Controlled AGI”.

If you haven’t read Post #12, don’t worry, the “Controlled AGI” research path is nothing fancy—it’s merely the idea of solving the alignment problem in the most obvious way possible:

The “Controlled AGI” research path:

Step 1 (out-of-scope for this series): We decide what we want our AGI’s motivation to be. For example, that might be:
- “Invent a better solar cell without causing catastrophe” (task-directed AGI),
- “Be a helpful assistant to the human supervisor” (corrigible AGI assistants),
- “Fulfill the human supervisor’s deepest life goals” (ambitious value learning),
- “Maximize coherent extrapolated volition”,
- or whatever else we choose.
Step 2 (subject of this post): We make an AGI with that motivation.

This post is about Step 2, whereas Step 1 is out-of-scope for this series. Honestly, I’d be ecstatic if we figured out how to reliably set the AGI’s motivation to any of those things I mentioned under Step 1.

Unfortunately, I don’t know any good plan for Step 2, and (I claim) nobody else does either. But I do have some vague thoughts and ideas, and I will share them here, in the spirit of brainstorming.

If you’re in a hurry and want to read a shorter and self-contained version of my least-bad proposed plan for Step 2, check out my separate post: Plan for mediocre alignment of brain-like [model-based RL] AGI (2023), which basically puts together the most obvious ideas mentioned in §14.2 and §14.3 into an end-to-end framework. I think that plan passes the low bar of “as far as I know, it might turn out OK”—well, I think I’m mildly skeptical, but I go back and forth, and I’m sure how to pin it down with more confidence. But obviously, we should be aiming higher than that! With stakes so high, we should really be starting from “there’s a strong reason to expect the plan to work, if carefully implemented”. And then we can start worrying about what can go wrong in the implementation. So we clearly still have work to do.

This post is not meant to be a comprehensive overview of the whole problem, just what I see as the most urgent missing ingredients.

Out of all the posts in the series, this post is the hands-down winner for “most lightly-held opinions”.

Table of contents:

§14.2 discusses what we might use as “Thought Assessors” in an AGI. If you’re just tuning in, Thought Assessors were defined in Posts #5–#6 and have been discussed throughout the series. If you have a Reinforcement Learning background, think of Thought Assessors as the components of a multi-dimensional value function. If you have a “being a human” background, think of Thought Assessors as learned functions that trigger visceral reactions (aversion, cortisol-release, etc.) based on the thought that you’re consciously thinking right now. In the case of brain-like AGIs, we get to pick whatever Thought Assessors we want, and I propose three categories for consideration: Thought Assessors oriented towards safety (e.g. “this thought / plan involves me being honest”), Thought Assessors oriented towards accomplishing a task (e.g. “this thought / plan will lead to better solar cell designs”), and Thought Assessors oriented purely towards interpretability (e.g. “this thought / plan has something to do with dogs”).
§14.3 discusses how we might generate supervisory signals to train those Thought Assessors. Part of this topic is what I call the “first-person problem”, namely the open question of whether it’s possible to take third-person labeled data (e.g. a YouTube video where Alice deceives Bob), and transmute it into a first-person preference (an AGI’s desire to not, itself, be deceptive).
§14.4 discusses the problem that the AGI will encounter “edge cases” in its preferences—plans or places where its preferences become ill-defined or self-contradictory. I’m cautiously optimistic that we can build a system that monitors the AGI’s thoughts and detects when it encounters an edge case. However, I don’t have any good idea about what to do when that happens. I’ll discuss a few possible solutions, including “conservatism”, and a couple different strategies for what Stuart Armstrong calls Concept Extrapolation.
§14.5 discusses the open question of whether we can rigorously prove anything about an AGI’s motivations. Doing so would seem to require diving into the AGI’s predictive world-model (which would probably be a multi-gigabyte, unlabeled (§2.7) data structure), and proving things about what the components of the world-model “mean”. I’m rather pessimistic about our prospects here, but I’ll mention possible paths forward, including John Wentworth’s “Natural Abstraction Hypothesis” research program (most recent update here).
§14.6 concludes with my overall thoughts about our prospects for “Controlled AGIs”. I’m currently a bit stumped and pessimistic about our prospects for coming up with a good plan, but hope I’m wrong and intend to keep thinking about it. I also note that a mediocre, unprincipled approach to “Controlled AGIs” (as in my “plan for mediocre alignment of brain-like AGI” post) would not necessarily cause a world-ending catastrophe—I think it’s hard to say.

14.2 Three categories of AGI Thought Assessors

As background, here’s our usual diagram of motivation in the human brain, from Post #6:

And here’s the modification for AGI, from Post #8:

On the center-right side of the diagram, I crossed out the words “cortisol”, “sugar”, “goosebumps”, etc. These correspond to the set of human innate visceral reactions which can be involuntarily triggered by thoughts (see Post #5).

(In machine learning terms, think of these as like the components of a multidimensional value function, as in multi-objective / multi-criteria reinforcement learning; or they can also be akin to the “pseudo” / “general” (non-reward-related) value functions of “Horde” (Sutton et al. 2011) and related algorithms.)

Clearly, things like cortisol, sugar, and goosebumps are the wrong Thought Assessors for our future AGIs. But what are the right ones? Well, we’re the programmers! We get to decide!

I have in mind three categories to pick from. I’ll talk about how they might be trained (i.e., supervised) in §14.3 below.

14.2.1 Safety & corrigibility Thought Assessors

Example thought assessors in this category:

This thought / plan involves me being helpful.
This thought / plan does not involve manipulating my own learning process, code, or motivation systems.
This thought / plan does not involve deceiving or manipulating anyone.
This thought / plan does not involve anyone getting hurt.
This thought / plan involves following human norms, or more generally, doing things that an ethical human would plausibly do.
This thought / plan is “low impact” (according to human common sense).
…

Arguably (cf. this Paul Christiano post), #1 is enough, and subsumes the rest. But I dunno, I figure it would be nice to have information broken down on all these counts, allowing us to change the relative weights in real time (§9.7), and perhaps giving an additional measure of safety.

Items #2–#3 are there because those are especially probable and dangerous types of thoughts—see discussion of Instrumental Convergence in §10.3.2.

Item #5 is a bit of a catch-all for the AGI finding weird out-of-the-box solutions to problems, i.e. it’s my feeble attempt to mitigate the so-called “Nearest Unblocked Strategy problem”. Why might it mitigate the problem? Because pattern-matching to “things that an ethical human would plausibly do” is a bit more like a whitelist than a blacklist. I still don’t think that would work on its own, don’t get me wrong, but maybe it would work in conjunction with the various other ideas in this post.

Before you jump into loophole-finding mode (“lol an ethical human would plausibly turn the world into paperclips if they’re under the influence of alien mind-control rays”), remember (1) these are meant to be implemented via pattern-matching to previously-seen examples (§14.3 below), not literal-genie-style following the exact words of the text; (2) we would hopefully also have some kind of out-of-distribution detection system (§14.4 below) to prevent the AGI from finding and exploiting weird edge-cases in that pattern-matching process. That said, as we’ll see, I don’t quite know how to do either of those two things, and even if we figure it out, I don’t have an airtight argument that it would be sufficient to get the intended safe behavior.

14.2.2 Task-related Thought Assessors

Example thought assessors in this category:

This thought / plan will lead to a reduction in global warming
This thought / plan will lead to a better solar panel design
This thought / plan will lead to my supervisor becoming fabulously rich
…

This kind of thing is why we built the AGI—what we actually want it to do. (Assuming task-directed AGI for simplicity.)

Basing a motivation system on these kinds of assessments by themselves would be obviously catastrophic. But maybe if we use these as motivations, in conjunction with the previous category, it will be OK. For example, imagine the AGI can only think thoughts that pattern-match to “I am being helpful” AND pattern-match to “there will be less global warming”.

That said, I’m not sure we want this category at all. Maybe the “I am being helpful” Thought Assessor by itself is sufficient. After all, if the human supervisor is trying to reduce global warming, then a helpful AGI would produce a plan to reduce global warming. That’s kinda the approach advocated by Paul Christiano (2017), I think.

14.2.3 “Ersatz interpretability” Thought Assessors

(See §9.6 for what I mean by “Ersatz interpretability”.)

As discussed in Posts #4–#5, each thought assessor is a model trained by supervised learning. Certainly, the more Thought Assessors we put into the AGI, the more computationally expensive it will be. But how much more? It depends. For example, I think the “valence” Thought Assessor in the human brain involves orders of magnitude more neurons than the “salivation” Thought Assessor. On the other hand, I think the “valence” Thought Assessor is far more accurate as a result. Anyway, as far as I know, it’s not impossible that we can put in $10^{7}$ Thought Assessors, and they’ll work well enough, and this will only add 1% to the total compute required by the AGI. I don’t know. So I’ll hope for the best and take the More Dakka approach: let’s put in 30,000 Thought Assessors, one for every word in the dictionary:

This thought / plan has something to do with AARDVARK
This thought / plan has something to do with ABACUS
This thought / plan has something to do with ABANDON
… … …
This thought / plan has something to do with ZOOPLANKTON

I expect that ML-savvy readers will be able to immediately suggest much-improved versions of this scheme—including versions with even more dakka—that involve things like contextual word embeddings and language models and so on. As one example, if we buy out and open-source Cyc (more on which below), we could use its hundreds of thousands of human-labeled concepts.

14.2.4 Combining Thought Assessors into a reward function

For an AGI to judge a thought / plan as being good, we’d like all the safety & corrigibility Thought Assessors from §14.2.1 to have as high a value as possible, and we’d like the task-related Thought Assessor from §14.2.2 (if we’re using one) to have as high a value as possible.

(The outputs of the interpretability Thought Assessors from §14.2.3 are not inputs to the AGI’s reward function, or indeed used at all in the AGI, I presume. I was figuring that they’d be silently spit out to help the programmers do debugging, testing, monitoring, etc.)

So the question is: how do we combine this array of numbers into a single overall score that can guide what the AGI decides to do?

A probably-bad answer is “add them up”. We don’t want the AGI going with a plan that performs catastrophically badly on all but one of the safety-related Thought Assessors, but so astronomically well on the last one that it makes up for it.

Instead, I imagine we’ll want to apply some kind of nonlinear function with strongly diminishing returns, and/or maybe even acceptability thresholds, before adding up the Thought Assessors into an overall score.

I don’t have much knowledge or opinion about the details. But there is some related literature on “scalarization” of multi-dimensional value functions—see here for some references.

14.3 Supervising the Thought Assessors, and the “first-person problem”

Recall from Posts #4–#6 that the Thought Assessors are trained by supervised learning. So we need a supervisory signal—what I labeled “ground truth in hindsight” in the diagram at the top.

I’ve talked about how the brain generates ground truth in numerous places, e.g. §3.2.1, Posts #7 & #13. How do we generate it for the AGI?

Well, one obvious possibility is to have the AGI watch YouTube, with lots of labels throughout the video for when we think the various Thought Assessors ought to be active. Then when we’re ready to send the AGI off into the world to solve problems, we turn off the labeled YouTube videos, and simultaneously freeze the Thought Assessors (= set the error signals to zero) in their current state. Well, I’m not sure if that would work; maybe the AGI has to go back and watch more labeled YouTube videos from time to time, to help the Thought Assessors keep up as the AGI’s world-model grows and changes.

One potential shortcoming of this approach is related to first-person versus third-person concepts. We want the AGI to have strong preferences about aspects of first-person plans—hopefully, the AGI will see “I will lie and deceive” as bad, and “I will be helpful” as good. But we can’t straightforwardly get that kind of preference from the AGI watching labeled YouTube videos. The AGI will see YouTube character Alice deceiving YouTube character Bob, but that’s different from the AGI itself being deceptive. And it’s a very important difference! Consider:

If you tell me “my AGI dislikes being deceptive”, I’ll say “good for you!”.
If you tell me “my AGI dislikes it when people are deceptive”, I’ll say “for god’s sake you better shut that thing off before it escapes human control and kills everyone”!!!

It sure would be great if there were a way to transform third-person data (e.g. a labeled YouTube video of Alice deceiving Bob) into an AGI’s first-person preferences (“I don’t want to be deceptive”). I call this the first-person problem.

How do we solve the first-person problem? I’m not entirely sure. I wrote my “Intuitive Self-Models” series (2024) partly as a giant rabbit hole trying to figure it out, and now at least have a vague idea (see §8.6.1 of that series), but little hope that it would actually work.

If the first-person problem is not solvable, we need to instead use the scary method of allowing the AGI to take actions, and putting labels on those actions. Why is that scary? First, because those actions might be dangerous. Second, because it doesn’t give us any good way to distinguish (for example) “the AGI said something dishonest” from “the AGI got caught saying something dishonest”. Conservatism and/or concept extrapolation (§14.4 below) could help with that “getting caught” problem—maybe we could manage to get our AGI both motivated to be honest and motivated to not get caught, and that could be good enough—but it still seems fraught for various reasons.

14.3.1 Side note: do we want first-person preferences?

I suspect that “the first-person problem” is intuitive for most readers. But I bet a subset of readers feel tempted to say that the first-person problem is not in fact a problem at all. After all, in the realm of human affairs, there’s a good argument that we could use a lot fewer first-person preferences!

The opposite of first-person preferences would be “impersonal consequentialist preferences”, wherein there’s a future situation that we want to bring about (e.g. “awesome post-AGI utopia”), and we make decisions to try to bring that about, without particular concern over what I-in-particular am doing. Indeed, too much first-person thinking leads to lots of things that I personally dislike in the world—e.g. jockeying for credit, blame avoidance, the act / omission distinction, social signaling, and so on.

Nevertheless, I still think giving AGIs first-person preferences is the right move for safety. Until we can establish super-reliable 12th-generation AGIs, I’d like them to treat “a bad thing happened (which had nothing to do with me)” as much less bad than “a bad thing happened (and it’s my fault)”. Humans have this notion, after all, and it seems at least relatively robust—for example, if I build a bank-robbing robot, and then it robs the bank, and then I protest “Hey I didn’t do anything wrong; it was the robot!”, I wouldn’t be fooling anybody, much less myself. An AGI with such a preference scheme would presumably be cautious and conservative when deciding what to do, and would default to inaction when in doubt. That seems generally good, which brings us to our next topic:

14.4 Conservatism and concept-extrapolation

14.4.1 Why not just relentlessly optimize the right abstract concept?

Let’s take a step back.

Suppose we build an AGI such that it has positive valence on the abstract concept “there will be lots of human flourishing”, and consequently makes plans and take actions to make that concept happen.

I actually find it pretty plausible that we’ll be able to do that, from a technical perspective. Just as above, we can use labeled YouTube videos and so on to make a Thought Assessor for “this thought / plan will lead to human flourishing”, and then base the reward function purely on that one Thought Assessor (cf. Post #7).

And then we set the AGI loose on an unsuspecting world, to go do whatever it thinks is best to do.

What could go wrong?

The problem is that the concept of “human flourishing” is an abstract concept in the AGI’s world-model—really, it’s just a fuzzy bundle of learned associations. It’s hard to know what actions a desire for “human flourishing” will induce, especially as the world itself changes, and the AGI’s understanding of the world changes even more. In other words, there is no future world that will perfectly pattern-match to the AGI’s current notion of “human flourishing”, and if an extremely powerful AGI optimized the world for the best possible pattern-match, we might wind up with something weird, even catastrophic. (Or maybe not! It’s pretty hard to say, more on which in §14.6.)

As some random examples of what might go wrong: maybe the AGI would take over the world and prevent humans and human society from changing or evolving forevermore, because those changes would reduce the pattern-match quality. Or maybe the least-bad pattern-match would be the AGI wiping out actual humans in favor of an endless modded game of The Sims. Not that The Sims is a perfect pattern-match to “human flourishing”—it’s probably pretty bad! But maybe it’s less bad a pattern-match than anything the AGI could feasibly do with actual real-world humans. Or maybe as the AGI learns more and more, its world-model gradually drifts and changes, such that the frozen Thought Assessor winds up pointing at something totally random and crazy, and then the AGI wipes out humans to tile the galaxy with paperclips. I don’t know!

So anyway, relentlessly optimizing a fixed, frozen abstract concept like “human flourishing” seems maybe problematic. Can we do better?

Well, it would be nice if we could also continually refine that concept, especially as the world itself, and the AGI’s understanding of the world, evolves. This idea is what Stuart Armstrong calls Concept Extrapolation, if I understand correctly.

Concept extrapolation is easier said than done—there’s no obvious ground truth for the question of “what is ‘human flourishing’, really?” For example, what would “human flourishing” mean in a future of transhuman brain-computer hybrid people and superintelligent evolved octopuses and god-only-knows-what-else?

Anyway, we can consider two steps to concept extrapolation. First (the easier part), we need to detect edge-cases in the AGI’s preferences. Second (the harder part), we need to figure out what the AGI should do when it comes across such an edge-case. Let’s talk about those in order.

14.4.2 The easier part of concept extrapolation: Detecting edge-cases in the AGI’s preferences

I’m cautiously optimistic about the feasibility of making a simple monitoring algorithm that can watch an AGI’s thoughts and detect that it’s in an edge-case situation—i.e., an out-of-distribution situation where its learned preferences and concepts are breaking down.

(Understanding the contents of the edge-case seems much harder, as discussed shortly, but here I’m just talking about recognizing the occurrence of an edge-case.)

To pick a few examples of possible telltale signs that an AGI is at an edge-case:

The learned probability distributions for Thought Assessors (see Post #4 footnote) could have a wide variance, indicating uncertainty.
The different Thought Assessors of §14.2 could diverge in new and unexpected ways.
The AGI’s valence could flip back and forth between positive and negative in a way that indicates “feeling torn” while paying attention to different aspects of the same possible plan.
The AGI’s generative world-model could settle into a state with very low prior probability, indicating confusion.

14.4.3 The harder part of concept extrapolation: What to do at an edge case

I don’t know of any good answer. Here are some options.

14.4.3.1 Option A: Conservatism—When in doubt, just don’t do it!

A straightforward approach would be that if the AGI’s edge-case-detector fires, it forces the valence signal negative—so that whatever thought the AGI was thinking is taken to be a bad thought / plan. This would loosely correspond to a “conservative” AGI.

(Side note: I think there may be many knobs we can turn in order to make a brain-like AGI more or less “conservative”, in different respects. The above is just one example. But they all seem to have the same issues.)

A failure mode of a conservative AGI is that the AGI just sits there, not doing anything, paralyzed by indecision, because every possible plan seems too uncertain or risky.

An “AGI paralyzed by indecision” is a failure mode, but it’s not a dangerous failure mode. Well, not unless we were foolish enough to put this AGI in charge of a burning airplane plummeting towards the ground. But that’s fine—in general, I think it’s OK to have first-generation AGIs that can sometimes get paralyzed by indecision, and which are thus not suited to solving crises where every second counts. Such an AGI could still do important work like inventing new technology, and in particular designing better and safer second-generation AGIs.

However, if the AGI is always paralyzed by indecision—such that it can’t get anything done—now we have a big problem. Presumably, in such a situation, future AGI programmers would just dial the “conservatism” knob down lower and lower, until the AGI started doing useful things. And at that point, it’s unclear if the remaining conservatism would be sufficient to buy us safety.

I think it would be much better to have a way for the AGI to iteratively gain information to reduce uncertainty, while remaining highly conservative in the face of whatever uncertainty still remains. So how can we do that?

14.4.3.2 Option B: Dumb algorithm to seek clarification in edge-cases

Here’s a slightly-silly illustrative example of what I have in mind. As above, we could have a simple monitoring algorithm that watches the AGI’s thoughts, and detects when it’s in an edge-case situation. As soon as it is, the monitoring algorithm shuts down the AGI entirely, and prints out the AGI’s current neural net activations (and corresponding Thought Assessor outputs). The programmers use interpretability tools to figure out what the AGI is thinking about, and manually assign a valence / value / reward, overriding the AGI’s previous uncertainty with a highly-confident ground-truth.

That particular story seems unrealistic, mainly because I’m skeptical that we’ll have the speed, manpower, and interpretability tools to keep up with how often I expect this situation to trigger. But maybe there’s a better approach than just printing out billions of neural activations and corresponding Thought Assessors?

The tricky part is that AGI-human communication is fundamentally a hard problem. It’s unclear to me whether it will be possible to solve that problem via a dumb algorithm. The situation here is very different from, say, an image classifier, where we can find an edge-case picture and just show it to the human. The AGI’s thoughts may be much more inscrutable than that.

By analogy, human-human communication is possible, but not by any dumb algorithm. We do it by leveraging the full power of our intellect—modeling what our conversation partner is thinking, strategically choosing words that will best convey a desired message, and learning through experience to communicate more and more effectively. So what if we try that approach?

14.4.3.3 Option C: The AGI wants to seek clarification in edge-cases

If I’m trying to help someone, I don’t need any special monitoring algorithm to prod me to seek clarification at edge-cases. Seeking clarification at edge-cases is just what I want to do, as a self-aware properly-motivated agent.

So what if we make our AGIs like that?

At first glance, this approach would seem to solve all the problems mentioned above. Not only that, but the AGI can use its full powers to make everything work better. In particular, it can learn its own increasingly-sophisticated metacognitive heuristics to flag edge-cases, and it can learn and apply the human’s meta-preferences about how and when the AGI should ask for clarification.

But there’s a catch. I was hoping for a conservatism / concept extrapolation system that would help protect us from misdirected motivations. If we implement conservatism / concept extrapolation via the motivation system itself, then we lose that protection.

More specifically: if we go up a level, the AGI still has a motivation (“seek clarification in edge-cases”), and that motivation is still an abstract concept that we have to extrapolate into out-of-distribution edge cases (“What if my supervisor is drunk, or dead, or confused? What if I ask a leading question?”). And for that concept extrapolation problem, we’re plowing ahead without a safety net.

Is that a problem? Bit of a long story:

Side-debate: Will “helpfulness”-type preferences “extrapolate” safely just by recursively applying to themselves?

In fact, a longstanding debate in AGI safety is whether these kinds of helpful / corrigible AGI preferences (e.g. an AGI’s desire to understand and follow a human’s preferences and meta-preferences) will “extrapolate” in a desirable way without any “safety net”—i.e., without any independent ground-truth mechanism pushing the AGI’s preferences in the right direction.

In the optimistic camp is Paul Christiano, who argued in “Corrigibility” (2017) that there would be “a broad basin of attraction towards acceptable outcomes”, based on, for example, the idea that an AGI’s preference to be helpful will result in the AGI having a self-reflective desire to continually edit its own preferences in a direction humans would like. But I don’t really buy that argument for reasons in my 2020 post—basically, I think there are bound to be sensitive areas like “what does it mean for people to want something” and “what are human communication norms” and “inclination to self-monitor”, and if the AGI’s preferences drift along any of those axes (or all of them simultaneously), I don’t think those preferences would self-correct.

Meanwhile, in the strongly-pessimistic camp is Eliezer Yudkowsky, I think mainly because of an argument (e.g. this post, final section) that we should expect powerful AGIs to have consequentialist preferences, and that consequentialist preferences seem incompatible with corrigibility. But I don’t really buy that argument either, for reasons in my 2021 “Consequentialism & Corrigibility” post—basically, I think there are possible preferences that are reflectively-stable, and that include consequentialist preferences (and thus are compatible with powerful capabilities), but are not purely consequentialist (and thus are compatible with corrigibility). A “preference to be helpful” seems like it could plausibly develop into that kind of hybrid preference scheme.

Anyway, I’m uncertain but leaning pessimistic. For more on the topic, see also Wei Dai’s recent post, and RogerDearnaley’s, and the comment sections of all of the posts linked above.

14.4.3.4 Option D: Something else?

I dunno.

14.5 Getting a handle on the world-model itself

The elephant in the room is the giant unlabeled generative world-model that lives inside the Thought Generator. The Thought Assessors provide a window into this world-model, but I’m concerned that it may be a rather small, foggy, and distorted window. Can we do better?

Ideally, we’d like to prove things about the AGI’s motivation. We’d like to say “Given the state of the AGI’s world-model and Thought Assessors, the AGI is definitely motivated to do X” (where X=be helpful, be honest, not hurt people, etc.) Wouldn’t that be great?

But we immediately slam into a brick wall: How do we prove anything whatsoever about the “meaning” of things in the world-model, and thus about the AGI’s motivation? The world is complicated, and therefore the world-model is complicated. The things we care about are fuzzy abstractions like “honesty” and “helpfulness”—see the Pointers Problem. The world-model keeps changing as the AGI learns more, and as it makes plans that would entail taking the world wildly out-of-distribution (e.g. planning the deployment of a new technology). How can we possibly prove anything here?

I still think the most likely answer is “We can’t”. But here are two possible paths anyway. For some related discussion, see Eliciting Latent Knowledge, and especially Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems (Dalrymple et al., 2024).

Proof strategy #1 starts with the idea that we live in a three-dimensional world containing objects and so on. We try to come up with an unambiguous definition of what those objects are, and from there we can have an unambiguous language for specifying what we want to happen in the world. We also somehow translate (or constrain) the AGI’s understanding of the world into that language, and now we can prove theorems about what the AGI is trying to do.

This is my tentative understanding of what John Wentworth is trying to do via his Natural Abstraction Hypothesis research program (most recent update here), and I’ve heard ideas in this vicinity from a couple other people as well. (Update: John disagrees with this characterization, see his comment.)

I’m skeptical because a 3D world of localized objects seems to be an unpromising starting point for stating and proving useful theorems about the AGI’s motivations. After all, a lot of things that we humans care about, and that the AGI needs to care about, seem difficult to describe in terms of a 3D world of localized objects—consider the notion of “honesty”, or “solar cell efficiency”, or even “daytime”.

Proof strategy #2 would start with a human-legible “reference world-model” (e.g. Cyc). This reference world-model wouldn’t be constrained to be built out of localized objects in a 3D world, so unlike the above, it could and probably would contain things like “honesty” and “solar cell efficiency” and “daytime”.

Then we try to directly match up things in the “reference world-model” with things in the AGI’s world-model.

Will they match up? No, of course not. Probably the best we can hope for is a fuzzy, many-to-many match, with various holes on both sides.

It’s hard for me to see a path to rigorously proving anything about the AGI’s motivations using this approach. Nevertheless, I continue to be amazed that unsupervised machine translation is possible at all, and I take that as an indirect hint that if pieces of two world-models match up with each other in their internal structure, then those pieces are probably describing the same real-world thing. So maybe I have the faintest glimmer of hope.

I’m unaware of work in this direction, possibly because it’s stupid and doomed, and also possibly because I don’t think we currently have any really great open-source human-legible world-models to run experiments on. The latter seems like it should be a fixable problem, so someone should fix it. I’ve mused about trying to open-source Cyc, but to be clear, that’s probably just one of many ways to develop a rich, accurate, and (most importantly) human-legible open-source world-model.

(See also some helpful discussion in Towards Guaranteed Safe AI about how to build an open-source human-legible world-model, although they have in mind a different end-use for it than I do. Indeed, there are lots of different reasons to want an awesome open-source human-legible world-model! All the more reason to make one!)

14.6 Conclusion: mild pessimism about finding a good solution, uncertainty about the consequences of a lousy solution

I think we have our work cut out figuring out how to solve the alignment problem via the “Controlled AGIs” route (as defined in Post #12). There are a bunch of open problems, and I’m currently pretty stumped. We should absolutely keep looking for good solutions, but right now I’m also open-minded to the possibility that we won’t find any. That’s why I continue to put a lot of my mental energy into the “social-instinct AGIs” path (Posts #12–#13), which seems somewhat less doomed to me, despite its various problems.

I note, however, that my pessimism is not universally shared—for example, as mentioned, Stuart Armstrong at AlignedAI appears optimistic about solving the open problem in §14.4, and John Wentworth and the Guaranteed Safe AI people appear optimistic about solving the open problem in §14.5. Let’s hope they’re right, wish them luck, and try to help!

To be clear, the thing I’m feeling pessimistic about is finding a good solution to “Controlled AGI”, i.e., a solution that we can feel extremely confident in a priori. A different question is: Suppose we try to make “Controlled AGI” via a lousy solution, like the §14.4.1 example (encapsulated in my post Plan for mediocre alignment of brain-like [model-based RL] AGI) where we imbue a super-powerful AGI with an all-consuming desire for the abstract concept of “human flourishing”, and the AGI then extrapolates that abstract concept arbitrarily far out of distribution in a totally-uncontrolled, totally-unprincipled way. Just how bad a future would such an AGI bring about? I’m very uncertain. Would such an AGI engage in mass torture? Umm, I guess I’m cautiously optimistic that it wouldn’t, absent a sign error from cosmic rays or whatever. Would it wipe out humanity? I think it’s possible!—see discussion in §14.4.1. But it might not! Hey, maybe it would even bring about a pretty awesome future! I just really don’t know, and I’m not even sure how to reduce my uncertainty.

In the next post, I will wrap up the series with my wish-list of open problems, and advice on how to get into the field and help solve them!

Changelog

July 2024: Since the initial version, I’ve made only minor changes. Mostly I added links to more recent content, particularly my own Plan for mediocre alignment of brain-like [model-based RL] AGI (which is basically a simpler self-contained version of part of this post), and Dalrymple et al.’s Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems, which is relevant to §14.5.

January 2026: Various minor edits and updates.

What links here?