Getting from unaligned to aligned (AGI-assisted alignment Part 1)

Me: I have some ideas about how how to make use of an unaligned AGI-system to make an aligned AGI-system.

Imaginary friend: My system 1 is predicting that a lot of confused and misguided ideas are about to come out of your mouth.

Me: I guess we’ll see. Maybe I’m missing the mark somehow. But do hear me out.

Imaginary friend: Ok.

Me: First off, do we agree that a superintelligence would be able to understand what you want when asking for something, presuming that it is given enough background information?

Imaginary friend: Well, kind of. Often there isn’t really a clear answer to what you want.

Me: Sure. But it would probably be good at predicting what looks to me like good answers. Even if it isn’t properly aligned, it would probably be extremely good at pretending to give me what I want. Right?

Imaginary friend: Agreed.

Me: So if I said to it “show me the best source code you can come up with for an aligned AGI-system, and write the code in such a way that it’s as easy as possible to verify that it works as it should”, then what it gave me would look really helpful—with no easily way for me to see a difference between what I’m provided and what I would be provided if it was aligned. Right?

Imaginary friend: I guess I sort of agree. Like, if it answered your request it would probably look really convincing. But maybe it doesn’t answer your question. It could find a security vulnerability in the OS and hack itself onto the internet somehow, and it would be game over before you even got to ask it any questions. Or maybe you didn’t even try to box it in the first place, since you didn’t realize how capable your AI-system was getting, and it was hiding its capabilities from you.

Imaginary friend: Or maybe it socially manipulated you in some really clever way, or “hacked” your neural circuitry somehow through sensory input, or figured out some way it could affect the physical world from within the digital realm (e.g. generating radio waves by “thinking” in ways that sends electrons in a particular pattern, or maybe exploiting principles of physics that humans aren’t aware of).

Imaginary friend: Those scenarios may sound cooky, but lots of things have already happened in this world that would seem cooky to you without the benefit of hindsight. Consider e.g. the path of causality that got us from simple replicators to humans, and from the stone age to iPhones, CERN, the international space station…

Me: I agree with all that. Better to be too paranoid about ways an unaligned AI can take over the world than not paranoid enough. But it’s hard to cover all the branches of things that should be considered in one conversation-path. So for the time being, let’s assume a hypothetical situation where the AI is “boxed” in. And let’s assume that we know it’s extremely capable, and that it can’t “hack” itself out of the box in some direct way (like exploiting a security flaw in the operating system). Ok?

Imaginary friend: Ok.

Me: I presume you agree that there are more and less safe ways to use a superintelligent AGI-system. To take an exaggerated example: There is a big difference between “letting it onto the internet” and “having it boxed in, only giving it multiplication questions, and only letting it answer yes or no”.

Imaginary friend: Obviously. But even if you only give it multiplication-questions, some less scrupulous team will sooner or later develop a superintelligence and use it in less careful ways..

Me: Sure. But still, we agree that there are more and less safe to try to use an AGI? There is a “scale” of sorts?

Imaginary friend: Of course.

Me: Would you also agree that there is a “scale” for how hard it is for an oracle/​genie to “trick” you into falsely believing that it has provided you with what you want? For example, if I ask it to prove a mathematical conjecture, that is much harder to “pretend” to do the way I want it without actually doing it (compared to most things)?

Imaginary friend: Sure.

Me: What I want to talk about are ways of asking an AGI genie/​oracle for things in ways where it’s hard for it to “pretend” that it’s giving us what we want without doing it. And ways we might leverage that to eventually end up with an aligned AGI-system, while trying to keep the total risk (of all the steps we take) low.

Imaginary friend: My system 1 suspects I am about to hear some half-baked ideas.

Me: And your system 1 is not wrong. But some of them will be more crystalized compared to others. I do have in my head an outline of how we might get from an unaligned superintelligent AGI-system to an aligned superintelligent AGI-system, while keeping total risk of extinction along the way at way less than 50% (presuming that the procedures are executed the way I’d advise, which is far from trivial). But I acknowledge that some of it is a bit half-baked. I don’t claim to have detailed and watertight plans/​arguments bringing us every step of the way from A to B.

Me: But it seems to me, based on my limited understanding, that this kind of approach might be underprioritised. And, believe me, I don’t have any kind of full overview or understanding of everything that has been explored in the AI alignment community…

Imaginary friend: I believe you.

Me: …but it seems to me that techniques and paths to get from unaligned AGI-systems to aligned AGI-systems is an area where there is much low-hanging fruit.

Me: Btw, where ideas come from is not something I keep good track of. Partly because I haven’t read/​heard everything that has been written/​said in the alignment debate. Partly because even when I hear someone describe an idea, it’s not trivial to understand the details/​nuances of what they have in mind. And partly because sometimes I’ll get an idea from someone, and then forget where it came from. But some of the people who have influenced me with their ideas about alignment are Eliezer Yudkowsky, Nick Bostrom, Paul Christiano, and Eric Drexler (not a comprehensive list). I am aware that the idea of bootstrapping alignment isn’t a new concept, but my impression is that it’s under-discussed.

Imaginary friend: It sounds like you are looking for ways to avoid having to solve the alignment problem, so that you can focus on other things instead of solving it beforehand.

Me: I guess it can be interpreted that way. I don’t want it to come across that way though! If we find reliable methods to align AGI before we have AGI, and these methods seem likely to work also for a superintelligent system, then that would improve our odds of success! But, you know, maybe we don’t, and if we don’t then it’s better to have some plans ready, so that less improvisation is necessary.

Me: And also, even if we think we have solved alignment in a reliable way prior to superintelligence, some of the techniques I’ll discuss can be used as an additional layer of security/​alignment-assurance. So some of this is stuff that should be encouraged as best practice anyway. AFAIK we don’t have a fire alarm, and it’s certainly not too early to start analyzing these kinds of scenarios in a more comprehensive way..

Restrictions in expressivity of AI

Me: So the first part of this discussion will focus on things that we can do while the AGI stays in the digital realm. We will get to the question of how to do complicated things in the non-digital realm, but that’s for later.

Imaginary friend: You understand that any use of an AGI enables it to affect the non-digital realm, right? If you let it write or modify code, that is a way for the AGI to communicate with humans, and communicating with humans is a way to affect the non-digital realm.

Me: Yes, I agree with that. But I would point out that it is up to us how much “expressivity” we give the AI-system. For example, if we get help from it to point out potential mistakes/​bugs in some piece of code, maybe we don’t let it write text to explain itself (limiting instead to things such as pointing out areas of code to take an extra look at, highlighting examples of input that may produce unintended behavior, etc). And if we do let it write text, then it is up to us how we limit the expressivity of what it writes in text.

Imaginary friend: Ok, did you by the way notice that you spoke about not letting it write text—but letting it write example input? Well, example input can be text! Those kinds of things can be easy to miss, and often you may not be aware of what you’re missing. And also, any way to point out bugs or potential problems is a way for the AI to affect how you think and what you think about. Take this test of selective attention for example. Or look at how magicians focus and misdirect people’s attention to trick them. There are lots of things like this.

Me: I may miss such things sometimes, but sometimes I will also leave them out even though I’m aware of them, so as to not get into tangents all the time while I’m trying to outline main concepts. I did notice that one though. Partly thanks to you helping me to get better at noticing those sorts of things.

Imaginary friend: That’s what imaginary friends are for.

Me: I’m certainly not claiming that the techniques I’ll describe are risk free. At the same time, it’s important to consider how different techniques can come together to reduce risk. For example, one thing that will be touched upon again and again is testing the convergence of the results from different methods and architectures and pieces of code. Preferably, if an AI-system writes the code for some module, then the overall system should ensure that we also have that module be written by other AI-systems, and also by other humans if possible.

Me: In regards to e.g. AIs pointing out potential bugs/​issues, the system as a whole (a system of systems of systems..) would preferably keep track of the changes made due to such suggestions, and also try to keep track of what the output of the system would be if those changes hadn’t been made.

Imaginary friend: Ok. But just because something makes you more safe, and you can handwave to a lot of techniques where each thing helps a bit with safety—this does not show that the the system as a whole will be safe. And I am a bit worried about you proposing ideas that are hard to criticize because they’re complicated/​vague, instead of getting to the “core” of the problem. Also, adding on lots of complication to increase security can in some cases be counter-productive.

Me: Some good points there. But I feel uncertain about the degree to which there is a “core” of the problem. I want this series to outline a path towards a more “complete” solution, but I want to start by focusing more on techniques and principles that “building-blocks” of the solution may make use of.

Verifying code equivalence

Me: One tool that we will come to at various points (and combine with other techniques) is the verification of equivalence in code behavior. By code behavior I mean what it takes as input, and what it gives as output (either the values themselves, or properties/​claims about the values, such as them being within a certain range/​”space”).

Me: But sometimes it can make sense for the proofs to also take into account run-time, properties of how the code affects the system that it runs on, etc.

Imaginary friend: What would those proofs look like?

Me: The proofs should of course be computable, and preferably it would be possible to display them in a format that makes the “logic” of them as clear as possible to humans. If they can be made to look visual in some user-friendly GUI then all the better.

Me: It could be fine for some part/​parts of the AI-system to take part in defining the format. But thinking beforehand of what the proof-format should look like and the properties it should have could lead to more clarity of thought, which is better than having to improvise after or right before the creation of AGI.

Me: And the AI-system should be used when verifying the proof-format itself, and the various “inference-steps” that are allowed. How to verify the validity of “inference-steps”—well, I don’t know the exact details, but the more “proofs” (that each seems sufficient by itself to a human) the better. After all, this follows from the principle of “Why not both?”.

Me: Preferably the verification-process would include proofs that are “mathematical”, but it is also important to consider that the behavior of code is something that can be tested (while more or less staying within the digital realm). One counterexample is enough to disprove the validity of a given set of accepted inference-steps.

Me: Btw, if we can prove the equivalence of code we can use that as a building-block towards proving what is different between the behavior of two pieces of code (e.g. “these pieces of code behave the same, with the exception of input that is within the range x”).

Topics I’ll cover in this series

Imaginary friend: So where do you see this discussion going?

Me: I guess we’ll see. But I do have some plans.

Me: One thing I want to discuss is the concept of “solution-generators” (and maybe also “solution-generators-generators”, etc). That is to say, when asking an oracle/​genie for something, it may often make sense to not ask the question directly, but to instead ask for a “solution-generator”—that is to say, a function that generates solutions within the domain in question. These “generators” should be optimized for being narrow/​specialized/​modular/​verifiable (and techniques for trying to ensure that the system is faithfully optimizing the “generator” for these properties is a topic by itself).

Me: For example, if we want to do a physics-simulation, then we might want to use a “simulation-generator” or “simulation-generator-generator”. One merit of this is that the code can be easier to inspect and understand than the code of a superintelligent AGI. Another merit is that the “generator” can be tested against existing data (and what data we want to test “generators” again is something that it is possible to plan and prepare for before we have AGI).

Me: But of course, you can also ask it to answer the question directly (in addition to asking for the “solution-generator”). That follows from the principle of “Why not both?”.

Imaginary friend: Isn’t it better to not ask it the specific question, so that it’s harder for it to guess what you’re after?

Me: I will mostly not be mentioning this sort of thing explicitly, but yes I agree, you shouldn’t ask both questions of the same instance of the AGI-system. The instance that is asked for A should not know that you also have asked it for B. Though of course, you could have additional instances that are asked both questions, and check if results/​output converges (that follows from the principle of “Why not both?”). Btw, I’ll brainstorm a bit later on about ways that sort of principle potentially could be generalized.

Me: Another topic I want to speculate a bit about is the topic of brain emulations, and also other “generators” that try to predict the behavior/​thinking of humans. I’m interested in the question of “Can we bootstrap from AGI to either brain emulations and/​or something that can predict human behavior/​thinking while more or less staying within the digital realm?”. Obviously there are lots of ethical considerations here, and it’s very important to avoid suffering subroutines insofar as possible!

Me: I also want to discuss techniques involving computable argument-trees/​”proofs”, where for every step of inference the inference-rule (how you get output/​conclusion from input/​arguments) is made explicit and computable. As well as outlining possible techniques to have the concepts/​propositions of such proofs represent more or less any thought that is sufficiently “clear/​crisp/​explicit” (blurring the distinction between “mathematical” proofs and any other argument about anything). Included in the discussion will be outlines of ideas for how to deal with “vagueness” and cluster-like concepts within such argument-trees/​”proofs”.

Me: And I’ll be outlining thoughts about capabilities that I think will help with verifying that instructions for doing things in the real world (developing new types of machines and that sort of thing) work as intended. Such as for example copying a strawberry at the molecular level without unintended consequences. Among other things there will be some focus on “generators” for mappings between (1) models/​ontologies, and (2) data-structures representing geometric structures (e.g. some sort of physics-simulation), and (3) real things in the actual world that the models are meant to refer to.

Me: There are various other things I plan to touch upon as well. Some of it having to do with verification-techniques, but also much about various other things. And maybe more will come up as I write. For a preview it is possible to take a look at a half-finished early draft for an earlier version of this article-series, which contains some of the stuff for future sections (but as mentioned, it’s half-finished).

Me: The more people there are who (1) are smart and have thought a lot about something and (2) see things differently from you, the more reason for self-doubt about your own judgment. And this is for me a significant source of uncertainty about my ideas in regards to alignment (and AI more generally). But it seems best to me to just try to describe my perspective as well as I can, and then people can do with that what seems best to them.

Imaginary friend: Talk to you later then.


Any feedback or comments (be that positive or negative or neither) would be received with interest.