# How easily can we separate a friendly AI in design space from one which would bring about a hyperexistential catastrophe?

Eliezer proposes that we separate an AI in design space from one that would constitute a fate worse than death if e.g. the reward model’s sign (+/​-) were flipped or the direction of updates to the reward model were reversed. This seems absolutely crucial, although I’m not yet aware of any robust way of doing this. Eliezer proposes assigning the AI a utility function of:

U = V + W

Where V refers to human values & W takes a very negative value for some arbitrary variable (e.g. diamond paperclips of length 5cm). So if the AI instead maximises -U, it’d realise that it can gain more utility by just tiling the universe with garbage.

But it seems entirely plausible that the error could occur with V instead of U, resulting in the AI maximising U = WV, which would result in torture.

------------------------------------------------------------------------------------------------------

Another proposition I found briefly described in a Facebook discussion that was linked to by somewhere. Stuart Armstrong proposes the following:

Let B1 and B2 be excellent, bestest outcomes. Define U(B1) = 1, U(B2) = −1, and U = 0 otherwise. Then, under certain assumptions about what probabilistic combinations of worlds it is possible to create, maximising or minimising U leads to good outcomes.
Or, more usefully, let X be some trivial feature that the agent can easily set to −1 or 1, and let U be a utility function with values in [0, 1]. Have the AI maximise or minimise XU. Then the AI will always aim for the same best world, just with a different X value.

Later, he suggests that X should be a historical fact (i.e. the value of X would be set in stone 10 seconds after the system is turned on.) As XU can only take positive values (because U has values in [0, 1]), the greatest value -XU could take would be 0 (which suggests merely killing everyone.)

But this could still be problematic if e.g. the bug occurred in the reward function/​model such that it gave positive values for bad things and negative values for bad things. Although I’m not sure how frequently errors effectively multiply everything in the reward by −1. I’m also unsure how this would interact with an error that reverses the direction of updates to a reward model.

------------------------------------------------------------------------------------------------------

A few possible (?) causes for this type of error include: (list obviously not exhaustive)

Bugs can optimise for bad behavior
One of our code refactors introduced a bug which flipped the sign of the reward. Flipping the reward would usually produce incoherent text, but the same bug also flipped the sign of the KL penalty. The result was a model which optimized for negative sentiment while preserving natural language. Since our instructions told humans to give very low ratings to continuations with sexually explicit text, the model quickly learned to output only content of this form.

The responses to this thread suggest that this type of thing would be noticed and redressed immediately, although this view doesn’t appear to be held unanimously. See also Gwern’s follow-up comment.

------------------------------------------------------------------------------------------------------

So, yeah. Are there any mechanisms to prevent this sort of thing from happening other than the two that I listed; and if not, would you expect the examples provided to robustly prevent this type of error from happening regardless of the cause?

• You seem to be approaching a kind of reasoning about this similar to something I’ve explored in a slightly different way (reducing risk of false positives in AI alignment mechanism design). You might find it interesting.

• Seems a little bit beyond me at 4:45am—I’ll probably take a look tomorrow when I’m less sleep deprived (although still can’t guarantee I’ll be able to make it through then; there’s quite a bit of technical language in there that makes my head spin.) Are you able to provide a brief tl;dr, and have you thought much about “sign flip in reward function” or “direction of updates to reward model flipped”-type errors specifically? It seems like these particularly nasty bugs could plausibly be mitigated more easily than avoiding false positives (as you defined them in the arxiv’s paper’s abstract) in general.

• Sleep is very important! Get regular sleep every night! Speaking from personal experience, you don’t want to have a sleep-deprivation-induced mental breakdown while thinking about Singularity stuff!

• My anxieties over this stuff tend not to be so bad late at night, TBH.

• Actually, I’m not sure that sign flips are easier to deal with. A sentiment I’ve heard expressed before is that it’s much easier to trim something to more a little more or less of it, but it’s much harder to know if you’ve got it pointed in the right direction or not. Ultimately, addressing false positives though ends up being about these kind of directional issues.

• it’s much harder to know if you’ve got it pointed in the right direction or not

Perhaps, but the type of thing I’m describing in the post is more preventing worse-than-death outcomes even if the sign is flipped (by designing a reward function/​model in such a way that it’s not going to torture everyone if that’s the case.)

This seems easier than recognising whether the sign is flipped or just designing a system that can’t experience these sign-flip type errors; I’m just unsure whether this is something that we have robust solutions for. If it turns out that someone’s figured out a reliable solution to this problem, then the only real concern is whether the AI’s developers would bother to implement it. I’d much rather risk the system going wrong and paperclipping than going wrong and turning “I have no mouth, and I must scream” into a reality.

• In a brain-like AGI, as I imagine it, the “neocortex” module does all the smart and dangerous things, but it’s a (sorta)-general-purpose learning algorithm that starts from knowing nothing (random weights) and gets smarter and smarter as it trains. Meanwhile a separate “subcortex” module is much simpler (dumber) but has a lot more hardwired information in it, and this module tries to steer the neocortex module to do things that we programmers want it to do, primarily (but not exclusively) by calculating a reward signal and sending it to the neocortex as it operates. In that case, let’s look at 3 scenarios:

1. The neocortex module is steered in the opposite direction from what was intended by the subcortex’s code, and this happens right from the beginning of training.

Then the neocortex probably wouldn’t work at all. The subcortex is important for capabilities as well as goals; for example, the subcortex (I believe) has a simple human-speech-sound detector, and it prods the neocortex that those sounds are important to analyze, and thus a baby’s neocortex learns to model human speech but not to model the intricacies of bird songs. The reliance on the subcortex for capabilities is less true in an “adult” AGI, but very true in a “baby” AGI I think; I’m skeptical that a system can bootstrap itself to superhuman intelligence without some hardwired guidance /​ curriculum early on. Moreover, in the event that the neocortex does work, it would probably misbehave in obvious ways very early on, before it knows it knows anything about the world, what is a “person”, etc. Hopefully there would be human or other monitoring of the training process that would catch that.

2. The neocortex module is steered in the opposite direction from what was intended by the subcortex’s code, and this happens when it is already smart.

The subcortex doesn’t provide a goal system as a nicely-wrapped package to be delivered to the neocortex; instead it delivers little bits of guidance at a time. Imagine that you’ve always loved beer, but when you drink it now, you hate it, it’s awful. You would probably stop drinking beer, but you would also say, “what’s going on?” Likewise, the neocortex would have developed a rich interwoven fabric of related goals and beliefs, much of which supports itself with very little ground-truth anchoring from subcortex feedback. If the subcortex suddenly changes its tune, there would be a transition period when the neocortex would retain most of its goal system from before, and might shut itself down, email the programmers, hack into the subcortex, or who knows what, to avoid getting turned into (what it still mostly sees as) a monster. The details are contingent on how we try to steer the neocortex.

3. The neocortex’s own goal system flips sign suddenly.

Then the neocortex would suddenly become remarkably ineffective. The neocortex uses the same system for flagging concepts as instrumental goals and flagging concepts as ultimate goals, so with a sign flip, it gets all the instrumental goals wrong; it finds it highly aversive to come up with a clever idea, or to understand something, etc. etc. It would take a lot of subcortical feedback to get the neocortex working again, if that’s even possible, and hopefully the subcortex would recognize a problem.

This is just brainstorming off the top of my (sleep-deprived) head. I think you’re going to say that none of these are particularly rock-solid assurance that the problem could never ever happen, and I’ll agree.

• I hadn’t really considered the possibility of a brain-inspired/​neuromorphic AI, thanks for the points.

(2) seems interesting; as I understand it, you’re basically suggesting that the error would occur gradually & the system would work to prevent it. Although maybe the AI realises it’s getting positive feedback for bad things and keeps doing them, or something (I don’t really know, I’m also a little sleep deprived and things like this tend to do my head in.) Like, if I hated beer then suddenly started liking it, I’d probably continue drinking it. Maybe the reward signals are simply so strong that the AI can’t resist turning into a “monster”, or whatever. Perhaps the system would implement checksums of some sort to do this automatically?

A similar point to (3) was raised by Dach in another thread, although I’m uncertain about this since GPT-2 was willing to explore new strategies when it got hit by a sign-flipping bug. I don’t doubt that it would be different with a neuromorphic system, though.

• Maybe the reward signals are simply so strong that the AI can’t resist turning into a “monster”, or whatever.

The whole point of the reward signals are to change the AI’s motivations; we design the system such that that will definitely happen. But a full motivation system might consist of 100,000 neocortical concepts flagged with various levels of “this concept is rewarding”, and each processing cycle where you get subcortical feedback, maybe only one or two of those flags would get rewritten, for example. Then it would spend a while feeling torn and conflicted about lots of things, as its motivation system gets gradually turned around. I’m thinking that we can and should design AGIs such that if it feels very torn and conflicted about something, it stops and alerts the programmer; and there should be a period where that happens in this scenario.

GPT-2 was willing to explore new strategies when it got hit by a sign-flipping bug

I don’t think that’s an example of (3), more like (1) or (2), or actually “none of the above because GPT-2 doesn’t have this kind of architecture”.

• I see. I’m somewhat unsure how likely AGI is to be built with a neuromorphic architecture though.

I don’t think that’s an example of (3), more like (1) or (2), or actually “none of the above because GPT-2 doesn’t have this kind of architecture”.

I just raised GPT-2 to indicate that flipping the goal sign suddenly can lead to optimising for bad behavior without the AI neglecting to consider new strategies. Presumably that’d suggest it’s also a possibility with cosmic ray/​other errors.

• I’m somewhat unsure how likely AGI is to be built with a neuromorphic architecture though.

I’m not sure what probability people on this forum would put on brain-inspired AGI. I personally would put >50%, but this seems quite a bit higher than other people on this forum, judging by how little brain algorithms are discussed here compared to prosaic (stereotypical PyTorch /​ Tensorflow-type) ML. Or maybe the explanation is something else, e.g. maybe people feel like they don’t have any tractable directions for progress in that scenario (or just don’t know enough to comment), or maybe they have radically different ideas than me about how the brain works and therefore don’t distinguish between prosaic AGI and brain-inspired AGI.

Understanding brain algorithms is a research program that thousands of geniuses are working on night and day, right now, as we speak, and the conclusion of the research program is guaranteed to be AGI. That seems like a pretty good reason to put at least some weight on it! I put even more weight on it because I’ve worked a lot on trying to understand how the neocortical algorithm works, and I don’t think that the algorithm is all that complicated (cf. “cortical uniformity”), and I think that ongoing work is zeroing in on it (see here).

• You seem to have a somewhat general argument against any solution that involves adding onto the utility function in “What if that added solution was bugged instead?”. While maybe this can be resolved, I think it’s better to move on from trying to directly target the sign-flip problem and instead deal with bugs/​accidents in general. After all, the sign-flip is just the very unlikely worse case scenario version of that, and any solution to dealing with bugs/​accidents in an AGI will also deal with it.

• You seem to have a somewhat general argument against any solution that involves adding onto the utility function in “What if that added solution was bugged instead?”.

I might’ve failed to make my argument clear: if we designed the utility function as U = V + W (where W is the thing being added on and V refers to human values), this would only stop the sign flipping error if it was U that got flipped. If it were instead V that got flipped (so the AI optimises for U = -V + W), that’d be problematic.

I think it’s better to move on from trying to directly target the sign-flip problem and instead deal with bugs/​accidents in general.

I disagree here. Obviously we’d want to mitigate both, but a robust way of preventing sign-flipping type errors specifically is absolutely crucial (if anything, so people stop worrying about it.) It’s much easier to prevent one specific bug from having an effect than trying to deal with all bugs in general.

• We’ll just have to agree to disagree here then. I just don’t find this particular worst-case-scenario likely enough to worry about. Any AGI with a robust way of dealing with bugs will deal with this, and any AGI without that will far more likely just break or paperclip us.

• Would you not agree that (assuming there’s an easy way of doing it), separating the system from hyperexistential risk is a good thing for psychological reasons? Even if you think it’s extremely unlikely, I’m not at all comfortable with the thought that our seed AI could screw up & design a successor that implements the opposite of our values; and I suspect there are at least some others who share that anxiety.

For the record, I think that this is also a risk worth worrying about for non-psychological reasons.

• If there is an easy way of fixing it sure, but I wouldn’t devote more than a small amount of mental effort towards thinking up solutions, unless you don’t see anywhere else you can plausibly help with alignment. Again, It’s just a weird worst-case scenario that could only happen by a combination of incredible incompetence and astronomically bad luck, it’s not even close to the default failure scenario.

I get you have anxiety and distress about this idea, but I don’t think hyper-focusing on it will help you. Nobody else is going to hyper-focus on it, they’ll just focus on dealing with bugs in general, so working alongside them seems like it’d be the most effective thing to do, rather than work on a weird super-specific problem by yourself.

• Eliezer proposes assigning the AI a utility function of:...

This is a bit misleading; in the article he describes it as “one seemingly obvious patch” and then in the next paragraph says “This patch would not actually work”.

• True, but note that he elaborates and comes up with a patch to the patch (that being have W refer to a class of events that would be expected to happen in the Universe’s expected lifespan rather than one that won’t.) So he still seems to support the basic idea, although he probably intended just to get the ball rolling with the concept rather than conclusively solve the problem.

• Oops, forgot about that. You’re right, he didn’t rule that out.

Is there a reason you don’t list his “A deeper solution” here? (Or did I miss it?) Because it trades off against capabilities? Or something else?

• Mainly for brevity, but also because it seems to involve quite a drastic change in how the reward function/​model as a whole functions. So it doesn’t seem particularly likely that it’ll be implemented.