Framing AI Childhoods

Generated as part of SERI MATS, under John Wentworth. Thanks to Alex Turner and Garrett Baker for related discussion, and to Justis Mills for draft feedback.

All bold claims and ignorant mistakes herein are my own.

Do you remember when Evan Hubinger became really enamored with ‘training stories,’ a way of carving up the alignment problem into ‘training rationales’ and ‘training goals’? Evan’s idea was that we ought to think of alignment as choosing a target model that we want to end up with after training, plus choosing a training procedure that will actually yield that model. In my prior experience talking to people about this way of framing the alignment problem… people didn’t especially get it. The typical response of those who had heard of this was, “Yeah, that’s one way to carve up the problem apart from, e.g., inner and outer alignment. But so what? How does this framing help us actually reduce the problem? It seems no better or worse than our old framings.”

I used to have that response myself. However, I now think I understand training stories! Here’s my take:

What Do Training Stories Buy You That Inner/​Outer Alignment Doesn’t?

It’s worth pointing out how phrasing inner and outer alignment in terms of training stories makes clear what I think was our biggest mistake in formulating that terminology, which is that inner/​outer alignment presumes that the right way to build an aligned model is to find an aligned loss function and then have a training goal of finding a model that optimizes for that loss function. However, as I hope the more general framework of training stories should make clear, there are many possible ways of trying to train an aligned model. Microscope AI and STEM AI are examples that I mentioned previously, but in general any approach that intends to use a loss function that would be problematic if directly optimized for, but then attempts to train a model that doesn’t directly optimize for that loss function, would fail on both outer and inner alignment—and yet might still result in an aligned model.

--Evan, How do we become confident in the safety of a machine learning system?

What training stories buy you is a framing of the alignment problem that doesn’t imply that the only way you can use a loss function is as an outer objective that you would like to see directly optimized for. This is one way you can use a loss function—but it definitely isn’t the only such way! It’s badly misleading for your framing of the alignment problem to suggest that inner/​outer alignment of a powerful model is the only way to survive.

For example, say that you were raising a kid. One childrearing scheme is to carefully make your kid’s childhood robust to arbitrary levels of precocious genius in your kid. You’d build a childhood such that overachieving in it would only ever be a good thing. You’d drill athletics, navigating complex adult social situations, difficult moral dilemmas, etc., always making sure that there isn’t some perverse victory condition way up near the skill ceiling of the task. For on this approach, you don’t ever want optimization power being pointed in a direction you wouldn’t want to see optimized, in the way that you don’t ever want to needlessly point a gun barrel at anything you don’t want destroyed. This childrearing scheme revolves around designing for overachievers, so that your kids will leave childhood once they rise to the level of “overachiever,” and then generalize their instilled drive to overachievement in the same spirit after they leave the nest.

You’ll notice that the above approach to childrearing is pretty weird, and looks more like Ender’s Game or Molly’s Game than it does any kind of conventionally advised childrearing. It’s in fact okay for behavior to be momentarily incentivized in childhood that you would not want to see optimized in adulthood! We successfully chisel out aligned kids because we understand their inductive biases well, and we work within the extremely path-dependent regime of human aging where applying nudges early in life can help result in an adult that avoids huge behavioral attractors, like heavy drug use, later on in life. It’s just not a very good model of a growing human to see them as a path-independent search over policies that you have to be perfectly cautious about ever, even temporarily, incentivizing in a way you wouldn’t want to see superintelligently optimized. Indeed, ignoring that young people can actively steer away from events that would change who they are and what they’d care about means prematurely giving up on most viable childrearing schemes! You’d be ill-advised as a new father if someone started you off explaining that a child is a search run over algorithms incentivized by the environment, rather than by foregrounding the theory of human inductive biases and human flavors of path-dependent aging.

Path-Dependence Apart from Deceptive Alignment

Say that you were searching over possible futures using a powerful search algorithm that simply conditioned on which futures looked good to you, and then sampled a future for you to look at from the uniform distribution over the remaining set of futures. You look around this presented possible future, and see that all is apparently good. What do you expect to happen when you actually move to instantiate that future?

Because of the path-independent structure of your powerful search algorithm, you should expect that world that looks so good to in fact be mediocre, under the surface. You optimized for something that would look good to you, and a world can be better shaped to look good if it doesn’t have to waste any resources on in fact being a nice place. Just take whatever resources were spent on illegible nice things, take them back, and spend them on appearances instead. This argument suggests that path-independent search, taken to the limit, will inevitably mislead you. It suggests being very careful about what you’re asking your path-independent search algorithm for, lest you get exactly what you asked for… and nothing more.

Suppose, now, that we add one path-dependent wrinkle to the story. You now first have to train the powerful search algorithm that you will then use to, carefully, find an actually good future. The longer you train the search algorithm on more data, the more powerful it grows. But, if the “search algorithm” your training procedure steps through is ever a smart consequentialist algorithm with arbitrary goals and an instrumental incentive to play along and survive training, that consequentialist algorithm will now output whatever behavioral profile you task it with outputting. If your training path (though algorithm space) ever routes through a deceptive consequentialist, all the grading models on behavior that follows will not improve outcomes. You are now doomed, and your training runs no longer have any leverage over that impending doom.

This, I think, is the training regime that the classic inner/​outer alignment framing suggests. But in fact, there’ll be a lot more path dependence in AI training than this! It’s not that you sample uniformly from algorithm space, simply conditioning on lower and lower loss, until you eventually find a deceptive (or aligned) consequentialist algorithm. Consequentialism is coherence, and coherence doesn’t crystallize all at once in one SGD step, and not at all before then. Instead, coherence meaningfully comes in degrees, and as you search over more and more complex algorithms, those algorithms will more and more begin to start defending themselves and increasingly satisfy the coherence properties that constitute consequentialism. Given that, the training process will, by degrees, become sensitive to what algorithm it finds long before a full-blown mesa-optimizer becomes situationally aware.

You’ll want to know all about those earlier path-dependent dynamics, if your job is to raise a corrigible AGI. Mainly, you’ll want a theory of SGD’s inductive biases, and a theory of the relationship between reinforcement of all kinds in given contexts and the algorithms those reinforcement events would eventuate in. Finally, you’d want, largely for communicative clarity, a framing of the alignment problem that foregrounds this theoretical hope.