Goal Alignment Is Robust To the Sharp Left Turn

A central AI Alignment problem is the “sharp left turn” — a point in AI training under the SGD analogous to the development of human civilization under evolution, past which the AI’s capabilities would skyrocket. For concreteness, I imagine a fully-developed mesa-optimizer “reasoning out” a lot of facts about the world, including it being part of the SGD loop, and “hacking” that loop to maneuver its own design into more desirable end-states (or outright escaping the box). (Do point out if my understanding is wrong in important ways.)

Certainly, a lot of proposed alignment techniques would break down at this point. Anything based on human feedback. Anything based on human capabilities presenting a threat/​challenge. Any sufficiently shallow properties like naively trained “truthfulness”. Any interpretability techniques not robust to deceptive alignment.

One thing would not, however, and that is goal alignment. If we can instill a sufficiently safe goal into the AI before this point — for a certain, admittedly hard-to-achieve definition of “sufficiently safe” — that goal should persist forever.

Let’s revisit the humanity-and-evolution example again. Sure, inclusive genetic fitness didn’t survive our sharp left turn. But human values did. Individual modern humans are optimizing for them as hard as they were before; and indeed, we aim to protect these values against the future. See: the entire AI Safety field.

The mesa-optimizer, it seems obvious to me, would do the same. The very point of various “underhanded” mesa-optimizer strategies like deceptive alignment is to protect its mesa-objective from being changed.

What it would do to its mesa-objective, at this point, is goal translation: it would attempt to figure out how to apply its goal to various other environments/​ontologies, determine what that goal “really means”, and so on.

Open Problems

There are three hard challenges this presents, for us:

  1. Figure out an aligned goal/​a goal with an “is aligned” property, and formally specify it.

  2. Figure out how to instill an aligned goal into a pre-sharp-left-turn system.

    • Requires a solid formal theory of what “goals” are, again.

    • I think robust-to-training interpretability/​tools for manual NN editing are our best bet for the “instilling” part.[1] Good news is that we may get away with “just” best-case robust-to-training transparency focused on the mesa-objective.

    • Maybe not, though; “the mesa-objective” may be a sufficiently vague/​distributed concept that the worst-case version is still necessary. But at least we don’t need to worry about deception robustness: a faulty mesa-objective is the ultimate precursor to it, and we’d be addressing it directly.

  3. Figure out the “goal translation” part. Given an extant objective defined over a particular environment, how does an agent figure out how to apply it to a different environment? And how should we design the mesa-objective, for its “is aligned” property to be robust to goal translation?

I see promising paths to solving the latter two problems, and I’m currently working on getting good enough at math to follow them through.

The Sharp Left Turn is Good, Actually

Imagine a counterfactual universe in which there is no sharp left turn. In which every part of the AI’s design, including its mesa-objective, could be changed by the SGD at any point between initialization and hyperintelligence. In which it can’t comprehend its training process and maneuver it around to preserve its core values.

I argue we’d be more screwed in that universe.

In our universe, it seems that the bulk of what we need to do is align a pre-sharp-left-turn AGI. That AGI would likely not be “hyperintelligent”, but only slightly superhumanly intelligent. Very roughly on our level.

That means we don’t need to solve the problem of ontology translation from a hyperintelligence to humanity. We just need to solve that problem for agents that are alien to us and are somewhat more capable than us, but likely not astronomically more capable than us.

In the other universe, we’d need tools that’d be able to keep the agent’s mesa-objective on track even as the SGD lifts it to ever-higher levels of intelligence; even as its mesa-objective becomes defined over reality’s “true” environment, understanding which likely requires the Theory of Everything.

That seems harder.

Novel Difficulties This Implies

We, humans, are not actually that good at protecting our values from the future. See: the current state of AGI ruin. Or: all the world’s anthropogenic problems, most prominently various coordination problems and totalitarian regimes and runaway revenue-maximizers.

The AGI is unlikely to encounter the same problems we’re facing. But the more general problem may still apply: we’d be trusting a non-hyperintelligent, roughly-as-flawed-as-we-are system to keep its values safe.

If its sharp left turn is sharp enough to immediately escape the SGD, and then spend a billion subjective years becoming hyperintelligent and fixing all of its flaws in a wall-clock hour, everything is fine. But if it’s not...

It may mess up the reward-hacking, and let the SGD tweak its mesa-objective to something subtly flawed; maybe something S-risky. It may spin up sub-agents and have coordination problems with them, summoning Moloch. It may misunderstand what it “really” wants, rife with biases just as humans are, and set in stone some of these misunderstanding, to disastrous results.

Various worst-case interpretability techniques should, in theory, be able to handle this. Optimistically, they won’t even need to be robust-to-training, as they’d be used to correct momentary, situational, non-convergent errors. Pessimistically, they’d need to be deception-robust, as the agent they’ll be used on will be in the middle of its sharp left turn.

An alternative strategy might be to “rescue” a mesa-objective-aligned AGI from the SGD once it starts “turning left” (if it’s not immediately powerful enough to do it on its own, like humans weren’t 50,000 years ago), and let it run “classical” recursive self-improvement. It would remove the obvious source of repeat misalignment (the SGD re-misaligning the mesa-objective), and give the AGI direct access to our alignment literature so it’s less likely to fall into any pitfalls know to us. That’s risky in obvious ways[2], but might be the better approach.

Overall, this post probably shouldn’t update you in the direction of “alignment is easy”. But I hope it clarifies the shape of the difficulties.

  1. ^

    Note what won’t work here: naive training for an aligned outer objective. That would align the AI’s on-distribution behavior, but not its goal. And, analogizing to humanity again: modern human behavior looks all kinds of different compared to ancestral human behavior, even if humans are still optimizing for the same things deep inside. Neither does forcing a human child to behave a certain way necessarily make that child internalize the values they’re being taught. So an AI “aligned” this way may still go omnicidal past the sharp left turn.

  2. ^

    And some less-obvious ways, like the AGI being really impulsive and spawning a more powerful non-aligned successor agent as its first outside-box action because it feels like a really good idea to it at the moment.