Getting from an unaligned AGI to an aligned AGI?

Summary / Preamble

In AGI Ruin: A List of Lethalities, Eliezer writes “A cognitive system with sufficiently high cognitive powers, given any medium-bandwidth channel of causal influence, will not find it difficult to bootstrap to overpowering capabilities independent of human infrastructure.”

I have larger error-bars than Eliezer on some AI-safety-related beliefs, but I share many of his concerns (thanks in large part to being influenced by his writings).

In this series I will try to explore if we might:

Start out with a superintelligent AGI that may be unaligned (but seems superficially aligned)
Only use the AGI in ways where it’s channels of causal influence are minimized (and where great steps are taken to make it hard for the AGI to hack itself out of the “box” it’s in)
Work quickly but step-by-step towards a AGI-system that probably is aligned, enabling us to use it in more and more extensive ways (as we get more assurances that it’s aligned)

From the AGI-system we may (directly or indirectly) obtain programs that are interpretable and verifiable. These specialized programs could give us new capabilities, and we may trust these capabilities to be aligned and safe (even if we don’t trust the AGI to be so). We may use these capabilities to help us with verification, widening the scope of programs we are able to verify (and maybe helping us to make the AGI-system safer to interact with). This could perhaps be a positive feedback-loop of sorts, where we get more and more aligned capabilities, and the AGI-system becomes safer and safer to interact with.

The reasons for exploring these kinds of strategies are two-fold:

Maybe we wont solve alignment prior to getting superintelligence (even though it would be better if we did!)
Even if we think we have solved alignment prior to superintelligence, some of the techniques and strategies outlined here could be encouraged as best practice, so that we get additional layers of alignment-assurance.

The strategy as a whole involves many iterative and contingency-dependent steps working together. I don’t claim to have a 100% watertight and crystalized plan that would get us from A to B. Maybe some readers could be inspired to build upon some of the ideas or analyze them more comprehensively.

Are any of the ideas in this series new? See here for a discussion of that.

Me: I have some ideas about how how to make use of an unaligned AGI-system to make an aligned AGI-system.

Imaginary friend: My system 1 is predicting that a lot of confused and misguided ideas are about to come out of your mouth.

Me: I guess we’ll see. Maybe I’m missing the mark somehow. But do hear me out.

Imaginary friend: Ok.

Me: First off, do we agree that a superintelligence would be able to understand what you want when asking for something, presuming that it is given enough information?

Imaginary friend: Well, kind of. Often there isn’t really a clear answer to what you want.

Me: Sure. But it would probably be good at predicting what looks to me like good answers. Even if it isn’t properly aligned, it would probably be extremely good at pretending to give me what I want. Right?

Imaginary friend: Agreed.

Me: So if I said to it “show me the best source code you can come up with for an aligned AGI-system, and write the code in such a way that it’s as easy as possible to verify that it works as it should”, then what it gave me would look really helpful—with no easily way for me to see a difference between what I’m provided and what I would be provided if it was aligned. Right?

Imaginary friend: I guess I sort of agree. Like, if it answered your request it would probably look really convincing. But maybe it doesn’t answer your question. It could find a security vulnerability in the OS, and hack itself onto the internet somehow—that would be game over before you even got to ask it any questions. Or maybe you didn’t even try to box it in the first place, since you didn’t realize how capable your AI-system was getting, and it was hiding its capabilities from you.

Or maybe it socially manipulated you in some really clever way, or “hacked” your neural circuitry somehow through sensory input, or figured out some way it could affect the physical world from within the digital realm (e.g. generating radio waves by “thinking”, or some thing we don’t even know is physically possible).

When we are dealing with a system that may prefer to destroy us (for instrumentally convergent reasons), and that system may be orders of magnitude smarter than ourselves—well, it’s better to be too careful than not paranoid enough..

Me: I agree with all that. But it’s hard to cover all the branches of things that should be considered in one conversation-path. So for the time being, let’s assume a hypothetical situation where the AI is “boxed” in. And let’s assume that we know it’s extremely capable, and that it can’t “hack” itself out of the box in some direct way (like exploiting a security flaw in the operating system). Ok?

Imaginary friend: Ok.

Me: I presume you agree that there are more and less safe ways to use a superintelligent AGI-system? To give an exaggerated example: There is a big difference between “letting it onto the internet” and “having it boxed in, only giving it multiplication questions, and only letting it answer yes or no”.

Imaginary friend: Obviously. But even if you only give it multiplication-questions, some other team will sooner or later develop AGI and be less scrupulous..

Me: Sure. But still, we agree that there are more and less safe to try to use an AGI? There is a “scale” of sorts?

Imaginary friend: Of course.

Me: Would you also agree that there is a “scale” for how hard it is for an oracle/genie to “trick” you into falsely believing that it has provided you with what you want? For example, if I ask it to prove a mathematical conjecture, that is much harder to “pretend” to do the way I want it without actually doing it (compared to most things)?

Imaginary friend: Sure.

Me: What I want to talk about are ways of asking an AGI genie/oracle for things in ways where it’s hard for it to “pretend” that it’s giving us what we want without doing it. And ways we might leverage that to eventually end up with an aligned AGI-system, while trying to keep the total risk (of all the steps we take) low.

Imaginary friend: My system 1 suspects I am about to hear some half-baked ideas.

Me: And your system 1 may have a point. I don’t claim to have detailed and watertight arguments showing exactly how we get from A to B. What I have is an outline of how we might get there, while minimizing risk along the way (not to 0%, but to way less than 50% if suggestions are executed faithfully).

Believe me, I don’t have full knowledge and understanding of everything that has been explored in the AI alignment community…

Imaginary friend: I believe you.

Me: …but it seems to me that the kinds of techniques and strategies I’ll be outlining are under-discussed.

Imaginary friend: It sounds like you are looking for ways to avoid having to solve the alignment problem, so that you can focus on other things instead of solving it beforehand.

Me: I guess it can be interpreted that way. I don’t want it to come across that way though! If we find reliable methods to align a superintelligent AGIs before we get superintelligent AGIs, then that would improve our odds of success!

But maybe we don’t. If we don’t then it’s better to have some plans ready, so that less improvisation is necessary. AFAIK there is no fire alarm for AGI, and it certainly doesn’t seem too early to start analyzing these kinds of scenarios in a more comprehensive way..

And also, even if we think we have solved alignment in a reliable way prior to superintelligence, some of the strategies and principles I’ll discuss can be used as additional layers of alignment-assurance. Hence me thinking that it may be a good idea to develop these ideas further, and to encourage them as best practice.

Restrictions in expressivity of AI

Me: The first part of this discussion will focus on things that we can do while the AGI stays in the digital realm. We will get to the question of how to do complicated things in the non-digital realm (and on the internet), but that’s for later.

Imaginary friend: You understand that any use of an AGI enables it to affect the non-digital realm, right? If you let it write or modify code, that is a way for the AGI to communicate with humans, and communicating with humans is a way to affect the non-digital realm.

Me: Yes, I agree with that. But I would point out that it is up to us how much “expressivity” we give the AI-system. For example, if we get help from it to point out potential mistakes in some piece of code, maybe we don’t let it write text to explain itself (limiting instead to things such as pointing out areas of code to take an extra look at, highlighting examples of input that may produce unintended behavior, etc). And if we do let it write text, then it is up to us how we limit the expressivity of what it can write in text.

Imaginary friend: Ok, did you by the way notice that you spoke about not letting it write text—but letting it write example input? Well, example input can be text! Those kinds of things can be easy to miss, and often you may not be aware of what you’re missing.

Me: I may miss such things sometimes, but often I’ll be aware and leave them out so as not to get into tangents. I did notice the one you pointed out now (as well as several more in the example I just gave). Partly thanks to you helping me to get better at noticing those sorts of things.

Imaginary friend: That’s what imaginary friends are for.

Keep in mind that any way to point out bugs or potential problems is a way for the AI to affect how you think and what you think about. Take this test of selective attention for example. Or look at how magicians focus and misdirect people’s attention to trick them. There are lots of things like this.

Me: I’m certainly not claiming that the techniques I’ll describe are risk-free. At the same time, it’s important to consider how different techniques can come together to reduce risk. For example, one thing that will be touched upon again and again is testing the convergence of the results from different methods and architectures and pieces of code. Preferably, if an AI-system writes the code for some module, then the overall system should ensure that we also have that module be written by other AI-systems, and also by humans if possible.

In regards to e.g. AIs pointing out potential issues, the system as a whole (a system of systems of systems..) would preferably keep track of the changes made due to such suggestions, and also try to keep track of what the output of the system would be if those changes hadn’t been made.

Imaginary friend: Ok. But just because something makes you more safe, and you can handwave to a lot of techniques where each thing helps a bit with safety—this does not show that the process as a whole will be safe. And I am a bit worried about you proposing ideas that are hard to criticize because they’re complicated and vague, instead of getting to the “core” of the problem. Also, adding on lots of complications to increase security can in some cases be counter-productive.

Me: Some good points there. Although do I feel uncertain about the degree to which there is a “core” of the problem. I want this series to outline a path towards a more “complete” solution, but I want to start by focusing more on techniques and principles that are “building-blocks”.

The best scenario is if the first superintelligent AGI is robustly aligned from the start. I think we should try hard to make that be the case, and I don’t hope this series is much of a distraction from that.

I am reminded of how I am in support of trying to reduce CO2 emissions today, but also support research on geoengineering.

Topics I’ll cover in this series

Imaginary friend: So where do you see this discussion going?

Me: I guess we’ll see. But I do have some plans.

One thing I want to discuss is the concept of “solution-generators” (and maybe also “solution-generators-generators”, etc). When asking an oracle/genie for something, it may often make sense to not ask the question directly, but to instead ask for a “solution-generator”—that is to say, a function that generates solutions within the domain in question. These “generators” should be optimized for being narrow/specialized/modular/verifiable (techniques for trying to ensure that the system is faithfully optimizing the “generator” for these properties is a topic by itself).

For example, if we want to do a physics-simulation, then we might want to use a “simulation-generator” or “simulation-generator-generator”. One merit of this is that the code can be easier to inspect and understand than the code of a superintelligent AGI. Another merit is that the “generator” can be tested against existing data.

But of course, you can also ask it to answer the question directly (in addition to asking for the “solution-generator”). And then you can see if the output is the same when you obtain it indirectly vs when you obtain it directly.

Imaginary friend: Isn’t it better not to ask it the specific question, so that it’s harder for it to guess what you’re after?

Me: I will mostly not be mentioning this sort of thing explicitly, but yes I agree, you shouldn’t ask both questions of the same instance of the AGI-system. The instance that is asked for A should not know that you also have asked it for B. Though of course, you could have additional instances that are asked both questions, and check if results/output converge.

Another topic I want to speculate a bit about is the topic of brain emulations, and also other “generators” that try to predict the behavior and thinking of humans. I’m interested in the question of “Can we bootstrap from AGI to either brain emulations and/or something else that can predict human behavior, while more or less staying within the digital realm?”. Obviously, there are lots of ethical considerations here, and it’s very important to avoid suffering subroutines insofar as possible!

Me: I also want to discuss techniques involving computable argument-trees/”proofs”, where every inference-step is made explicit. As well as outlining possible techniques to have the concepts/propositions of such proofs represent more or less any thought that is sufficiently “clear/crisp/explicit” (blurring the distinction between “mathematical” proofs and any other argument about anything). Included in the discussion will be outlines of ideas for how to deal with “vagueness” and cluster-like concepts within such argument-trees/”proofs”.

And I’ll be outlining thoughts about capabilities that I think will help with verifying that instructions for doing things in the real world will work as intended. Such as for example copying a strawberry at the molecular level without unintended consequences.

The more smart and thoughtful people who see things differently from you, the more reason for self-doubt about your own judgment. And this is for me a significant source of uncertainty about my ideas in regards to alignment (and AI more generally). But it seems best to me to try to describe my perspective as well as I can, and then people can do with that what seems best to them.

Imaginary friend: Talk to you later then.

Any feedback or comments (be that positive or negative or neither) would be received with interest.

Getting from an unaligned AGI to an aligned AGI?

Summary /​ Preamble

Restrictions in expressivity of AI

Topics I’ll cover in this series

Summary / Preamble