Alignment and Deep Learning

Between the recent AI breakthroughs and Eliezer’s open admission of how bleak the chances of alignment are, everyone is speaking up and contributing what they can. It seems to me that there’s a route that very few people are talking about that stands a better chance of successful alignment than conventional approaches, and if there’s ever a time to talk about such things, this is it.

We all know the basics of the alignment question-how can we figure out and designate human values well enough to define an AI’s goals, despite the fact that our values are complex, fragile, and understood on an intuitive level much more than a rigorous one?

Ten years ago, AI researchers were working on a goal that was complex, fragile and almost purely intuitive, one that resisted both brute force and all attempts to define clever strategies to the point that many experts claimed it was literally unsolvable. I am talking, of course, about the game of go.

While chess masters will sometimes talk about recognizing patterns of checkmate that can be reused from game to game[1], go is incredibly dependent on intuition. Not only are there vastly more possible go games than particles in the known universe, but it’s chaotic in the sense of Chaos Theory: incredible sensitivity to initial conditions. While two pictures that differ by a pixel are effectively the same image, two games differing by a single stone can have opposite outcomes. This is not a domain where one can simply run a Monte Carlo Tree Search and call it a day[2]!

No one ever made the MIRI approach work on go: explicit rules in a rigorous system that would encompass exactly what we want to do on a go board[3]. And if Friendly AI and the potential fate of the human race depended on doing so before anyone developed and deployed AGI, it’s fair to say that we would be out of options. Yet by 15 March 2016, AlphaGo had defeated Lee Se-dol soundly, and soon attained such a dominance that the Korean master retired, unwilling to pursue a domain where even if he defeated every human opponent, machine intelligence was unchallengeable. Could a similar approach work on AI alignment?

The conventional wisdom here has been that alignment research is something that ought to be undertaken without help from AI: a stupid program cannot meaningfully help you and a smart program will not be available until the danger that you’re trying to avoid! And any non-rigorous approach is considered a non-starter: even a tiny deviation from human values could result in a dark future, and even if an AI seemed “close enough” for now, would it stay that way after however many rounds of self improvement? I want to challenge both these claims.

After all, one could make much the same arguments that AlphaGo cannot possibly work. If you don’t know how to define what a go engine should do on the board-control the center? Focus on capturing? Build life in the corners?-then how can a computer help you when you don’t know how to program it? And with go strategies often turning on a single stone across a 361-point board, wouldn’t any lack of rigor in the goals defined lead to escalating failures?

In the case of AlphaGo, those problems were solved by deep learning, setting up a neural net without trying to define exactly what element of strategy each neuron corresponds to, and instead allowing gradient descent to shape the net into something that can win. As Lee Se-dol found out, that is power enough to solve the problem, despite it seeming utterly intractable for much the same reasons as alignment.

The obvious counterargument here is that the DeepMind team did have a clear loss function to define, even if they couldn’t teach AlphaGo the intermediate steps directly. AlphaGo began by trying to predict the moves in games between human masters. Once it had managed that, it continued training through self-play, with variations that won games favored over those that didn’t. Go may be hard to win, but it’s trivial to score. Human morality? Not so much.

There is, however, a potential way to apply deep learning to value alignment: train agents on predicting each other’s values. DeepMind’s XLand environment is already being used to experiment with agents that have a goal in the virtual world and must learn how to attain it, developing generalizable skills in the process. It would be possible to define a loss function on how well one agent predicted another’s baseline objective, teaching AIs to learn utility functions from observation rather than only from what is hardcoded into them. It would also be possible to incentivize conservatism in such an environment: score the predictor’s actions by the values of the other agent, disincentivizing reckless action along the way to learning their utilities[4].

Proposals to create AIs that are trying to learn human values are, of course, not new. However, I suspect that training the skills of learning values and not offending them during that process can produce corrigibility that is nearly impossible to attain by other methods. Consider the stop button problem, for instance. A robot seeking to perform some task (perhaps making a cup of tea, as in Rob Miles’ excellent video on the subject) has a stop button, a shutdown switch that will prevent it from taking further action. It is nearly impossible to define a utility function for this robot that will not lead to degenerate behavior! If you’ve defined exactly how you want it to make that tea (don’t break anything along the way, don’t run people over, don’t tile the universe with sensors intended to make extra sure that you’ve actually made the tea… infinite list of caveats continues...) then you might be okay, but the seeming impossibility of defining such goals is precisely the difficulty. And if you figure that you’ll just press the stop button if it deviates from what you actually wanted, well, what’s the utility of having the stop button pressed? If it’s less than the utility of making the tea, the robot has an incentive to prevent you from pressing it. If it’s more than making the tea, now it wants you to press it, and may well engage in harmful behavior specifically so you will (or it just presses the button itself and is plain useless for anything else). The problem here is the intersection of instrumentally convergent goals like avoiding shutdown with the wireheading problem, or more generally the problem of taking actions that short-circuit more worthwhile goals. It’s very difficult to define values purely in terms of the robot’s actions that do not fall into one side or the other.

What happens if the robot’s utility is instead defined in terms of predicting and implementing the other agent’s values? Does it hit the stop button? No, it knows the other agent will rate that poorly. Does it attempt to induce the other agent to do so? No, even if it succeeds, the other agent will rate this poorly. I wanted a cup of tea, not a fistfight with a robot! Does it attempt to halt the button from being pressed? No, the other agent will rate submission to a potential halt order above resistance. It is precisely the complexities of the other agent’s utility function, those complexities which we do not know how to represent symbolically with enough fidelity to avoid disaster, which the robot is incentivized to learn and follow.

To make this a little clearer, the problem we are trying to solve is as follows:

  1. We have a utility function U, which we want to maximize, or at least make very large.

  2. We do not know how to specify U, and the closest we can come is U’, which usually correlates fairly strongly with U.

  3. Optimization of U’ maximizes not only U’, but also the difference between U’ and U. Therefore, a U’ maximizer will tend to result in very low levels of U.

Where in this system is it possible to make improvements? 1 is unalterable, the utility function is not up for grabs, after all. 2 could be improved if we could figure out how to specify U, or at least a U’ that was close enough to result in acceptable levels of U when U’ is very large. 3 could be improved if we had a system that wasn’t maximizing, but instead creating large but non-maximal levels of U’. This is the approach of quantilizers[5], as well as the otherizer problem which seeks to define a more effective way of doing this.

Both of these goals, finding a better U’ and finding an effective way of making it large without its divergence from U resulting in a bad outcome, are problems where we do not know how to articulate an effective strategy, even in principle. But this is exactly where we were with go! In both go and values alignment, we cannot articulate exactly what we want an AI to do. And in both, with machine learning gaining feedback over time, be it in self play or in learning to align to another agent, hopefully we can create systems that solve the problem anyway. The most obvious objection to this is that such a system would at least initially have an imperfect model of human values, and that this divergence could be dangerous, if not immediately, than increasingly so as the system gains in capability. However, it’s worth noting that while we tend to think of alignment as binary, the entire point of the otherizer problem is that it may be possible to have an AI that is safe and useful even if its model of our values isn’t perfect to start with. After all, if the model was perfect, it would be safe to simply maximize it and call it a day; otherizers are intended to operate when that isn’t safe.

If this idea can be usefully implemented, it will probably require breakthroughs that I have no conception of as yet. However, to make this proposal as concrete as possible, I am currently thinking of it as follows:

  1. Create a virtual environment, somewhat like the current XLand system, with two agents, the predictor and the user.

  2. Assign the user a utility function. At first, assign it randomly, though later it may be possible to choose utility functions specifically for providing a challenge to the predictor, or for shoring up specific blind spots the predictor has.

  3. The user operates in the environment according to their utility function. The predictor may interact with the user and environment, observing, communicating with the user, or adopting whatever other strategy it wishes.

  4. The predictor’s actions are scored according to the utility of the user as it was at the beginning of the interaction (so as to not give an incentive to alter the user’s values, or reward doing so). This should reward both conservatism along to way to learning the user’s values, and taking action that increases the user’s utility once it is well enough understood. The predictor can decide the best tradeoffs between more effort spend learning values vs fulfilling them, as well as whether to maximize, quantilize or something else.

  5. If need be, the predictor can also be scored according to how well it characterizes the user’s utility function, perhaps being asked to output what it thinks the user was programmed to do, to output a map of the user’s neural net, or to answer a battery of questions about what the user would prefer in a range of situations. In the last case, it might be useful to use an adversarial agent to select precisely those questions most likely to trip up the predictor.

Technological progress has historically relied on experimentation, looking to the real world to check our ideas and to give us suggestions we could not easily generate ourselves. If AI alignment research has stalled, perhaps it is time to start testing models! After all, many people tried unsuccessfully to create a flying machine, but it was the Wright Brothers who built a wind tunnel to test their designs. And perhaps not coincidentally, it was the Wright Brothers who flew two years later.

  1. ^

    My favorite example of this is Murray Chandler’s book How to Beat Your Dad at Chess, in which he lists “fifty deadly checkmates” that show up often enough that knowing them can dramatically boost a beginner’s strength.

  2. ^

    Though it ended up being part of the solution.

  3. ^

    In all fairness to MIRI, much of the focus on this approach was less due to failure of imagination and more due to hoping to train people up in clearer domains less vulnerable to crackpots, as Eliezer noted recently:

    MIRI have been very gung-ho about using logic and causal networks. At the same time they mostly ignored learning theory.

    “I’ll remark in passing that I disagree with this characterization of events. We looked under some street lights where the light was better, because we didn’t think that others blundering around in the dark were really being that helpful—including because of the social phenomenon where they blundered around until a bad solution Goodharted past their blurry filters; we wanted to train people up in domains where wrong answers could be recognized as that by the sort of sharp formal criteria that inexperienced thinkers can still accept as criticism.

    That was explicitly the idea at the time.”

  4. ^

    This is necessary because otherwise an AGI trained to learn human values would likely destroy the world in its own right. Maybe it dissects human brains to better map out our neurons and figure out what we would have wanted!

  5. ^

    These are AI systems which choose a random policy which is expected to be in the top n percent of outcomes ranked by utility. The hope is that they will exert enough optimization pressure to be useful, without exerting enough to create degenerate results. The concern is that they may give away too much utility, randomly choose a catastrophic policy anyway, or self-modify into more directly dangerous systems like maximizers.

  6. ^

    Specifically with the real values of the other agent at the start of the interaction, so as not to reward altering the other agent.