Plan for mediocre alignment of brain-like [model-based RL] AGI

(This post is a more simple, self-contained, and pedagogical version of Post #14 of Intro to Brain-Like AGI Safety.)

(Vaguely related to this Alex Turner post and this John Wentworth post.)

I would like to have a technical plan for which there is a strong robust reason to believe that we’ll get an aligned AGI and a good future. This post is not such a plan.

However, I also don’t have a strong reason to believe that this plan wouldn’t work. Really, I want to throw up my hands and say “I don’t know whether this would lead to a good future or not”. By “good future” here I don’t mean optimally-good—whatever that means—but just “much better than the world today, and certainly much better than a universe full of paperclips”. I currently have no plan, not even a vague plan, with any prayer of getting to an optimally-good future. That would be a much narrower target to hit.

Even so, that makes me more optimistic than at least some people.[1] Or at least, more optimistic about this specific part of the story. In general I think many things can go wrong as we transition to the post-AGI world—see discussion by Dai & Soares—and overall I feel very doom-y, particularly for reasons here.

This plan is specific to the possible future scenario (a.k.a. “threat model” if you’re a doomer like me) that future AI researchers will develop “brain-like AGI”, i.e. learning algorithms that are similar to the brain’s within-lifetime learning algorithms. (I am not talking about evolution-as-a-learning-algorithm.) These algorithms, I claim, are in the general category of model-based reinforcement learning. Model-based RL is a big and heterogeneous category, but I suspect that for any kind of model-based RL AGI, this plan would be at least somewhat applicable. For very different technological paths to AGI, this post is probably pretty irrelevant.

But anyway, if someone published an algorithm for x-risk-capable brain-like AGI tomorrow, and we urgently needed to do something, this blog post is more-or-less what I would propose to try. It’s the least-bad plan that I currently know.

So I figure it’s worth writing up this plan in a more approachable and self-contained format.

1. Intuition: Making a human into a moon-lover (“selenophile”)

Try to think of who is the coolest /​ highest-status-to-you /​ biggest-halo-effect person in your world. (Real or fictional.) Now imagine that this person says: “You know what’s friggin awesome? The moon. I just love it. The moon is the best.”

You stand there with your mouth agape, muttering to yourself in hushed tones: “Wow, huh, the moon, yeah, I never thought about it that way.” (But 100× moreso. Maybe you’re on some psychedelic at the time, or this is happening during your impressionable teenage years, or whatever.) You basically transform into a “moon fanboy” /​ “moon fangirl” /​ “moon nerd” /​ “selenophile”.

How would that change your motivations and behaviors going forward?

  • You’re probably going to be much more enthusiastic about anything associated with the moon.

  • You’re probably going to spend a lot more time gazing at the moon when it’s in the sky.

  • If there are moon-themed trading cards, maybe you would collect them.

  • If NASA is taking volunteers to train as astronauts for a trip to the moon, maybe you’d enthusiastically sign up.

  • If a supervillain is planning to blow up the moon, you’ll probably be extremely opposed to that, and motivated to stop them.

Hopefully this is all intuitive so far.

What’s happening mechanistically in your brain? As background, I think we should say that one part of your brain (the cortex, more-or-less) has “thoughts”, and another part of your brain (the basal ganglia, more-or-less) assigns a “value” (in RL terminology) a.k.a. “valence” (in psych terminology) to those thoughts.

And what happened in the above intervention is that your value function was edited such that thoughts-involving-the-moon would get very positive valence. Thoughts-involving-the-moon include just thinking about the moon by itself, but also include things like “the idea of collecting moon trading cards” and “the idea of going to the moon”.

Slightly more detail: As a simple and not-too-unrealistic model, we can imagine that the world-model is compositional, and that the value function is linearly additive over the compositional pieces. So if a thought entails imagining a moon poster hanging on the wall, the valence of that thought would be some kind of weighted average of your brain’s valence for “the moon”, and your brain’s valence for “poster hanging on the wall”, and your brain’s valence for “white circle on a black background”, etc., with weights /​ details depending on precisely how you’re thinking about it (e.g. which aspects you’re attending to, what categories /​ analogies you’re implicitly invoking, etc.).

So looking at the moon becomes positive-valence, but so do moon-themed trading cards, since the latter has “the moon” as one piece of the composite thought. Meanwhile the thought “A supervillain is going to blow up the moon” becomes negative-valence for technical reasons in the footnote→[2].

Anyway, assigning a positive versus negative valence to the concept “the moon” is objectively pretty weird. What in god’s name does it mean for “the moon” to be good or bad? It doesn’t even make sense. Yet people totally do that. They’ll even argue with each other about what valence assignment is “correct”.

(It makes sense to declare that an action or plan is good: you can do it! And it makes sense to declare that a state-of-the-world is good: you can try to bring it about! But the concept “the moon”, in and of itself, is none of those things. I strongly recommend Scott Alexander’s blog post Ethnic Tension and Meaningless Arguments musing on this topic.)

To be explicit, I think the ability to assign valence to concepts—even when doing so kinda makes no sense—is not learned, but rather a fundamental part of how brain learning algorithms work—it’s right there in the source code, so to speak. I think it’s at least plausibly how future AGIs will work too.

1.1 But wouldn’t a smart person recognize that “thinking the moon is awesome” is stupid and incoherent?

Yes! A smart person would indeed realize that assigning a positive valence to the moon is not really a thing that makes any sense.

But think about what happens when you’re doing ethical reasoning, or more generally /​ mundanely, when you’re deciding what to do: (1) you think a thought, (2) notice what its valence is, (3) repeat. There’s a lot more going on, but ultimately your motivations have to ground out in the valence of different thoughts, one way or the other.

Suppose I tell you “You really ought to put pebbles in your ears.” You say “Why?” And I say “Because, y’know, your ears, they don’t have any pebbles in them, but they really should.” And again you say “Why?” …At some point, this conversation has to ground out with something that you find inherently, intuitively positive-valence or negative-valence, in and of itself. Right?

And if I replace this dialogue with a monologue, where it’s just you in an empty room reflecting on what to do with your life, the same principle applies.

Now, as a human, you already have a whole complicated value function assigning positive and negative valence to all sorts of things, thanks to a lifetime of updates (ultimately tracing to reward function calculations centered around your hypothalamus & brainstem). But if we intervene to assign a high enough valence to the moon, compared to the preexisting valence of everything else in your world-model (justice and friendship and eating etc.), then it’s eventually going to shift your behavior towards—well I don’t know exactly, but towards activities and goals and plans and philosophies and values that heavily involve your “moon” concept.

2. Analogously, aligning an AGI to “human flourishing”

Let’s put aside the question of bootstrapping (see FAQ below) and assume that I have somehow built a brain-like AGI with some basic understanding of the world and ability to plan and get around. Assuming that AGI has already seen lots of human language, it will have certainly learned the human concept “human flourishing”—since after all it needs to understand what humans mean when they utter that specific pair of words. So then we can go into the AI and edit its value function such that whatever neural activations are associated with “human flourishing” get an extremely high value /​ valence. Maybe just to be safe, we can set the value/​valence of everything else in the AGI’s world to be zero. And bam, now the AI thinks that the concept “human flourishing” is really great, and that feeling will influence how it assesses future thoughts /​ actions /​ plans.

Just as the previous section involved turning you into a “moon fanboy/​fangirl”, we have now likewise made the AGI into a “human flourishing fanAGI”.

…And then what happens? I don’t know! It seems very hard to predict. The AGI has a “human flourishing” concept which is really a not-terribly-coherent bundle of pattern-match associations, the details of which are complex and hard to predict. And then the AGI will assess the desirability of thoughts /​ plans /​ actions based on how well they activate that concept. Some of those thoughts will be self-reflective, as it deliberates on the meaning of life etc. Damned if I know exactly what the AGI is going to do at the end of the day. But it seems at least plausible that it will do things that I judge as good, or even great, i.e. things vaguely in the category of “actualizing human flourishing in the world”.

Again, if a “moon fanboy/​fangirl” would be very upset at the idea of the moon disappearing forever in a puff of smoke, then one might analogously hope that an extremely smart and powerful “human flourishing fanAGI” would be very upset at the idea of human flourishing disappearing from the universe, and would endeavor to prevent that from happening.

3. FAQ

Q: Wouldn’t the AGI self-modify to make itself falsely believe that there’s a lot of human flourishing? Or that human flourishing is just another term for hydrogen?

A: No, for the same reason that, if a supervillain is threatening to blow up the moon, and I think the moon is super-cool, I would not self-modify to make myself falsely believe that “the moon” is a white circle that I cut out of paper and taped to my ceiling.

The technical reason is: Self-modifying is a bit complicated, so I would presumably self-modify because I had a plan to self-modify. A plan is a type of thought, and I’m using my current value function to evaluate the appeal (valence) of thoughts. Such a thought would score poorly under my current values (under which the moon is not in fact a piece of paper taped to the ceiling), so I wouldn’t execute that plan. More discussion here.

Q: Won’t the AGI intervene to prevent humans from turning into superficially-different transhumans? After all, “transhuman flourishing” isn’t a great pattern-match to “human flourishing”, right?

A: Hmm, yeah, that seems possible. And I think the are a lot of other issues like that too. As mentioned at the top, I never claimed that this was a great plan, only that it seems like it can plausibly get us to somewhere better than the status quo. I don’t have any better ideas right now.

Q: Speaking of which, why “human flourishing” in the first place? Why not “CEV”? Why not “I am being corrigible & helpful”?

A: Mostly I don’t know—I consider the ideal target an open question and discuss it more here. (It also doesn’t have to be just one thing.) But FWIW I can say what I was thinking when I opted to pick “human flourishing” as my example for this post, rather than either of those other two things.

First, why didn’t I pick “CEV”? Well in my mind, the concept “human flourishing” has a relatively direct grounding in various types of (abstractions over) plausible real-world situations—the kind of thing that could be pattern-matched to pretty well. Whereas when I imagine CEV, it’s this very abstruse philosophical notion in my mind. If we go by the “distance metric” of “how my brain pattern-matches different things with each other”, the things that are “similar” to CEV are, umm, philosophical blog posts and thought experiments and so on. In other words, at least for me, CEV isn’t a grounded real-world thing. I have no clue what it would actually look like in the end. If you describe a scenario and ask if it’s a good match to “maximizing CEV”, I would have absolutely no idea. So a plan centered around an AGI pattern-matching to the “CEV” concept seems like it just wouldn’t work.

(By the same token, a commenter in my last post on this suggested that “human flourishing” was inferior to “Do things that tend to increase the total subjective utility (weighted by amount of consciousness) of all sentient beings”. Yeah sure, that thing sounds pretty great, but it strikes me as a complicated multi-step composite thought, whereas what I’m talking about needs to be an atomic concept /​ category /​ chunk in the world-model, I think.)

Second, why not “I am being corrigible & helpful?” Well, I see two problems with that. One is: “the first-person problem”: Unless we have great interpretability (and I hope we do!), the only way to identify the neural activations for “I am being corrigible & helpful” is to catch the AGI itself in the act of being actually sincerely corrigible & helpful, and flag the corresponding neural activations. But we can’t tell from the AGI’s actions whether that’s happening—as opposed to the AGI acting corrigible & helpful for nefarious purposes. By contrast, the “human flourishing” concept can probably be picked up decently well from having the AGI passively watch YouTube and seeing what neural activations fire when a character is literally saying the words “human flourishing”, for example. The other problem is: I’m slightly skeptical that a corrigible docile helper AGI should be what we’re going for in the first place, for reasons here. (There’s also an objection that a corrigible helper AGI is almost guaranteed to be reflectively-unstable, or else not very capable, but I mostly don’t buy that objection for reasons here.)

Q: Wait hang on a sec. If we identify the “human flourishing” concept by “which neurons are active when somebody says the words ‘human flourishing’ while the AGI watches a YouTube video”, then how do you know that those neural activations are really “human flourishing” and not “person saying the words ‘human flourishing’”, or “person saying the words ‘human flourishing’ in a YouTube video”, etc.?

A: Hmm, fair enough. That’s a potential failure mode. Hopefully we’ll be more careful than just doing the YouTube thing and pressing “Go” on the AGI value-function-editing-routine. Specifically, once we get a candidate concept inside the AGI’s unlabeled world-model, I propose to do some extra work to try to confirm that this concept is indeed the “human flourishing” concept we were hoping for. That extra work would probably be broadly in the category of interpretability—e.g. studying when those neurons are active or not, what they connect to, etc.

(As a special case, it’s particularly important that the AGI winds up thinking that the real world is real, and that YouTube videos are not; making that happen might turn out to require at least some amount of training the AGI with a robot body in the real world, which in turn might pose competitiveness concerns.)

Q: If we set the valence of everything apart from “human flourishing” to zero, won’t the AGI just be totally incompetent? For example, wouldn’t it neglect to recharge its batteries, if the thought of recharging its batteries has zero valence?

A: In principle, an omniscient agent could get by with every other valence being zero, thanks to explicit planning /​ means-end reasoning. For example, it might think the thought “I’m going to recharge my battery and by doing so, eventually increase human flourishing” and that composite thought would be appealing (cf. the compositionality discussion above), so the AGI would do it. That said, for non-omniscient (a.k.a. real) agents, I think that’s probably unrealistic. It’s probably necessary-for-capabilities to put positive valence directly onto instrumentally-useful thoughts and behaviors—it’s basically a method of “caching” useful steps. I think the brain has an algorithm to do that, in which, if X (say, keeping a to-do list) is instrumentally useful for Y (something something human flourishing), and Y has positive valence, then X gets some positive valence too, at least after a couple repetitions. So maybe, after we perform our intervention that sets “human flourishing” to a high valence, we can set all the other preexisting valences to gradually decay away, and meanwhile run that algorithm to give fresh positive valences to instrumentally-useful thoughts /​ actions /​ plans.

Q: Whoa, but wait, if you do that, then in the long term the AGI will have positive valence on both “human flourishing” and various instrumentally-useful behaviors /​ subgoals that are not themselves “human flourishing”. And the source code doesn’t have any fundamental distinction between instrumental & final goals. So what if it reflects on the meaning of life and decides to pursue the latter at the expense of human flourishing?

A: Hmm. Yeah I guess that could happen. It also might not. I dunno.

I do think that, in this part of the learning algorithm, if X ultimately gets its valence from contributing to high-valence Y, then we wind up with X having some valence, but not as much as Y has. So it’s not unreasonable to hope that the “human flourishing” valence will remain much more positive than the valence of anything else, and thus “human flourishing” has a decent chance of carrying the day when the AGI self-reflects on what it cares about and what it should do in life. Also, “carrying the day” is a stronger claim than I need to make here; I’m really just hoping that its good feelings towards “human flourishing” will not be crushed entirely, and that hope is even more likely to pan out.

Q: What about ontological crises /​ what Stuart Armstrong calls “Concept Extrapolation” /​ what Scott Alexander calls “the tails coming apart”? In other words, as the AGI learns more and/​or considers out-of-distribution plans, it might come find that the web-of-associations corresponding to the “human flourishing” concept are splitting apart. Then what does it do?

A: I talk about that much more in §14.4 here, but basically I don’t know. The plan here is to just hope for the best. More specifically: As the AGI learns new things about the world, and as the world itself changes, the “human flourishing” concept will stop pointing to a coherent “cluster in thingspace”, and the AGI will decide somehow or other what it cares about, in its new understanding of the world. According to the plan discussed in this blog post, we have no control over how that process will unfold and where it will end up. Hopefully somewhere good, but who knows?

Q: This plan needs a “bootstrapping” step, where the AGI needs to be smart enough to know what “human flourishing” is before we intervene to give that concept a high value /​ valence. How does that work?

A: I dunno. We can just set the AGI up as if we were maximizing capabilities, and hope that, during training, the AGI will come to understand the “human flourishing” concept long before it is willing and able to undermine our plans, create backup copies, obfuscate its thoughts, etc. And then (hopefully) we can time our valence-editing intervention to happen within that gap.

Boxing could help here, as could (maybe) doing the first stage of training in passive (pure self-supervised) learning mode.

To be clear, I’m not denying that this is a possible failure mode. But it doesn’t seem like an unsolvable problem.

Q: What else could go wrong?

A: The motivations of this AGI would be very different from the motivations of any human (or animal). So I feel some very general cloud of uncertainty around this plan. I have no point of reference; I don’t know what the “unknown unknowns” are. So I assume other things could go wrong but I’m not sure what.

Q: If this is a mediocre-but-not-totally-doomed plan, then what’s the next step to make this plan incrementally better? Or what’s the next step to learn more about whether this plan would actually work?

A: There’s some more discussion here but I mostly don’t know. ¯\_(ツ)_/​¯

I’m mostly spending my research time thinking about something superficially different from “directly iterating on this plan”, namely reverse-engineering human social instincts—see here for a very short summary of what that means and why I’m doing it. I think there’s some chance that this project will help illuminate /​ “deconfuse” the mediocre plan discussed here, but it might also lead to a somewhat different and hopefully-better plan.

This is what “human flourishing” looks like, according to Stable Diffusion (top) and DALL-E 2 (bottom). 🤔
  1. ^

    For example, I commonly hear things like “We currently have no plan with any prayer of aiming a powerful AGI at any particular thing whatsoever; our strong default expectation should be that it optimizes something totally random like tiny molecular squiggles.” E.g. Nate Soares suggests here that he has ≳90% credence on not even getting anywhere remotely close to an intended goal /​ motivation, if I’m understanding him correctly.

    Incidentally, this is also relevant to s-risks: There’s a school of thought that alignment research might be bad for s-risk, because our strong default expectation right now is a universe full of tiny molecular squiggles, which kinda sucks but at least it doesn’t involve any suffering, whereas alignment research could change that. But that’s not my strong default expectation. I think the plan I discuss here would be a really obvious thing that would immediately pop into the head of any future AGI developer (assuming we’re in the brain-like AGI development path), and this plan would have at least a decent chance of leading us a future with lots of sentient life, for better or worse.

  2. ^

    I think if you imagine a supervillain blowing up the moon, it sorta manifests as a two-sequential step thought in which the moon is first present and then absent. I think such a thought gets the opposite-sign valence of the moon-concept itself, i.e. negative valence in this case, thanks to something vaguely related to the time derivative that shows up in Temporal Difference learning. I will omit the details, about which I still have a bit of uncertainty anyway, but in any case I expect these details to be obvious by the time we have AGI.