The Alignment Problem: Machine Learning and Human Values

Link post

The Alignment Problem: Machine Learning and Human Values, by Brian Christian, was just released. This is an extended summary + opinion, a version without the quotes from the book will go out in the next Alignment Newsletter.


This book starts off with an explanation of machine learning and problems that we can currently see with it, including detailed stories and analysis of:

- The gorilla misclassification incident

- The faulty reward in CoastRunners

- The gender bias in language models

- The failure of facial recognition models on minorities

- The COMPAS controversy (leading up to impossibility results in fairness)

- The neural net that thought asthma reduced the risk of pneumonia

It then moves on to agency and reinforcement learning, covering from a more historical and academic perspective how we have arrived at such ideas as temporal difference learning, reward shaping, curriculum design, and curiosity, across the fields of machine learning, behavioral psychology, and neuroscience. While the connections aren’t always explicit, a knowledgeable reader can connect the academic examples given in these chapters to the ideas of specification gaming and mesa optimization that we talk about frequently in this newsletter. Chapter 5 especially highlights that agent design is not just a matter of specifying a reward: often, rewards will do ~nothing, and the main requirement to get a competent agent is to provide good shaping rewards or a good curriculum. Just as in the previous part, Brian traces the intellectual history of these ideas, providing detailed stories of (for example):

- BF Skinner’s experiments in training pigeons

- The invention of the perceptron

- The success of TD-Gammon, and later AlphaGo Zero

The final part, titled “Normativity”, delves much more deeply into the alignment problem. While the previous two parts are partially organized around AI capabilities—how to get AI systems that optimize for their objectives—this last one tackles head on the problem that we want AI systems that optimize for our (often-unknown) objectives, covering such topics as imitation learning, inverse reinforcement learning, learning from preferences, iterated amplification, impact regularization, calibrated uncertainty estimates, and moral uncertainty.


I really enjoyed this book, primarily because of the tracing of the intellectual history of various ideas. While I knew of most of these ideas, and often also who initially came up with the ideas, it’s much more engaging to read the detailed stories of _how_ that person came to develop the idea; Brian’s book delivers this again and again, functioning like a well-organized literature survey that is also fun to read because of its great storytelling. I struggled a fair amount in writing this summary, because I kept wanting to somehow communicate the writing style; in the end I decided not to do it and to instead give a few examples of passages from the book in this post.


Note: It is generally not allowed to have quotations this long from this book; I have specifically gotten permission to do so.

Here’s an example of agents with evolved inner reward functions, which lead to the inner alignment problems we’ve previously worried about:

They created a two-dimensional virtual world in which simulated organisms (or “agents”) could move around a landscape, eat, be preyed upon, and reproduce. Each organism’s “genetic code” contained the agent’s reward function: how much it liked food, how much it disliked being near predators, and so forth. During its lifetime, it would use reinforcement learning to learn how to take actions to maximize these rewards. When an organism reproduced, its reward function would be passed on to its descendants, along with some random mutations. Ackley and Littman seeded an initial world population with a bunch of randomly generated agents.

“And then,” says Littman, “we just ran it, for seven million time steps, which was a lot at the time. The computers were slower then.” What happens? As Littman summarizes: “Weird things happen.”

At a high level, most of the successful individual agents’ reward functions ended up being fairly comprehensible. Food was typically viewed as good. Predators were typically viewed as bad. But a closer look revealed some bizarre quirks. Some agents, for instance, learned only to approach food if it was north of them, for instance, but not if it was south of them.

“It didn’t love food in all directions,” says Littman. “There were these weird holes in [the reward function]. And if we fixed those holes, then the agents became so good at eating that they ate themselves to death.”

The virtual landscape Ackley and Littman had built contained areas with trees, where the agents could hide to avoid predators. The agents learned to just generally enjoy hanging out around trees. The agents that gravitated toward trees ended up surviving—because when the predators showed up, they had a ready place to hide.

However, there was a problem. Their hardwired reward system, honed by their evolution, told them that hanging out around trees was good. Gradually their learning process would learn that going toward trees would be “good” according to this reward system, and venturing far from trees would be “bad.” As they learned over their lifetimes to optimize their behavior for this, and got better and better at latching onto tree areas and never leaving, they reached a point of what Ackley dubbed “tree senility.” They never left the trees, ran out of food, and starved to death.

However, because this “tree senility” always managed to set in after the agents had reached their reproductive age, it was never selected against by evolution, and huge societies of tree-loving agents flourished.

For Littman, there was a deeper message than the strangeness and arbitrariness of evolution. “It’s an interesting case study of: Sure, it has a reward function—but it’s not the reward function in isolation that’s meaningful. It’s the interaction between the reward function and the behavior that it engenders.”

In particular, the tree-senile agents were born with a reward function that was optimal for them, provided they weren’t overly proficient at acting to maximize that reward. Once they grew more capable and more adept, they maxed out their reward function to their peril—and, ultimately, their doom.

Maybe everyone but me already knows this, but here’s one of the best examples I’ve seen about the benefits of transparency:

Ambrosino was building a rule-based model using the pneumonia data. One night, as he was training the model, he noticed it had learned a rule that seemed very strange. The rule was “If the patient has a history of asthma, then they are low-risk and you should treat them as an outpatient.”

Ambrosino didn’t know what to make of it. He showed it to Caruana. As Caruana recounts, “He’s like, ‘Rich, what do you think this means? It doesn’t make any sense.’ You don’t have to be a doctor to question whether asthma is good for you if you’ve got pneumonia.” The pair attended the next group meeting, where a number of doctors were present; maybe the MDs had an insight that had eluded the computer scientists. “They said, ‘You know, it’s probably a real pattern in the data.’ They said, ‘We consider asthma such a serious risk factor for pneumonia patients that we not only put them right in the hospital . . . we probably put them right in the ICU and critical care.’ ”

The correlation that the rule-based system had learned, in other words, was real. Asthmatics really were, on average, less likely to die from pneumonia than the general population. But this was precisely because of the elevated level of care they received. “So the very care that the asthmatics are receiving that is making them low-risk is what the model would deny from those patients,” Caruana explains. “I think you can see the problem here.” A model that was recommending outpatient status for asthmatics wasn’t just wrong; it was life-threateningly dangerous.

What Caruana immediately understood, looking at the bizarre logic that the rule-based system had found, was that his neural network must have captured the same logic, too—it just wasn’t as obvious.


Now, twenty years later, he had powerful interpretable models. It was like having a stronger microscope, and suddenly seeing the mites in your pillow, the bacteria on your skin.

“I looked at it, and I was just like, ‘Oh my— I can’t believe it.’ It thinks chest pain is good for you. It thinks heart disease is good for you. It thinks being over 100 is good for you....It thinks all these things are good for you that are just obviously not good for you.”

None of them made any more medical sense than asthma; the correlations were just as real, but again it was precisely the fact that these patients were prioritized for more intensive care that made them as likely to survive as they were.

“Thank God,” he says, “we didn’t ship the neural net.”

Finally, on the importance of reward shaping:

In his secret top-floor laboratory, though, Skinner had a different challenge before him: to figure out not which schedules of reinforcement ingrained simple behaviors most deeply, but rather how to engender fairly complex behavior merely by administering rewards. The difficulty became obvious when he and his colleagues one day tried to teach a pigeon how to bowl. They set up a miniature bowling alley, complete with wooden ball and toy pins, and intended to give the pigeon its first food reward as soon as it made a swipe at the ball. Unfortunately, nothing happened. The pigeon did no such thing. The experimenters waited and waited. . . and eventually ran out of patience.

Then they took a different tack. As Skinner recounts:

> We decided to reinforce any response which had the slightest resemblance to a swipe— perhaps, at first, merely the behavior of looking at the ball—and then to select responses which more closely approximated the final form. The result amazed us. In a few minutes, the ball was caroming off the walls of the box as if the pigeon had been a champion squash player.

The result was so startling and striking that two of Skinner’s researchers—the wife-and- husband team of Marian and Keller Breland—decided to give up their careers in academic psychology to start an animal-training company. “We wanted to try to make our living,” said Marian, “using Skinner’s principles of the control of behavior.” (Their friend Paul Meehl, whom we met briefly in Chapter 3, bet them $10 they would fail. He lost that bet, and they proudly framed his check.) Their company—Animal Behavior Enterprises—would become the largest company of its kind in the world, training all manner of animals to perform on television and film, in commercials, and at theme parks like SeaWorld. More than a living: they made an empire.