[Book Review] “The Alignment Problem” by Brian Christian
I came to this book with an ax to grind. I combed through page after page for factual errors, minor misrepresentations or even just contestable opinions. I spotted (what seemed like) omission after omission only to be frustrated just a few pages later when Brian Christian addressed them. In the Chapter 5: Shaping I thought I found a major mistake. Brian Christian addressed Skinnerian operant conditioning without addressing the real way we manages human groups: leading by example.
That’s because he dedicated all of Chapter 7: Imitation to the subject. Thus, through gritted teeth, I reluctantly acknowledge that The Alignment Problem by Brian Christian is a fantastic book in all respects.
Despite my best efforts, Brian Christian even taught me lots of cool things about state-of-the-art machine learning. The Alignment Problem addresses advanced technical problems while being readable to non-technical people. This book would be a useful read both for activists who want to better understand public policy AND for aspiring engineers who want to get up to speed with machine learning. The only possible fault I can imagine with this book is that, since it depends so heavily on cutting-edge research, it might be rendered obsolete in a decade or two. Much of it mirrors the actual technical work I’m doing in machine learning.
The book starts with practical real-world problem that are happening right now. Most of the book is dedicated to explaining machine learning problems and their solutions. At the end it extrapolates on to the choices machine learning creates for our future.
Racist Machines
In 2015 Google famously released an image classifier that labelled Black people as “gorillas” because there were so few Black people in its training dataset. The good solution is to add more Black people to the training dataset. The fast solution is to keep using a biased algorithm and just cover up the most egregious errors. I don’t know which approach Google went with but “three years later, in 2018, Wired reported that the label ‘gorilla’ was still manually deactivated in Google Photos.”
I’m curious what animal I would get classified as if people who look like me were removed from Google Photos training dataset. (I hope it’s a meerkat.) Alas, computer algorithms are already being used to make much more consequential decisions. Many of these decisions involve problems that can’t be solved just by collecting more training data.
The Mathematics of Social Justice
[This section refers to situations in the United States unless otherwise noted.]
Black people and White people self-report similar marijuana usage. However, Black people are arrested for marijuana usage much more frequently than White people. Suppose you are designing an algorithm to determine how much to punish a prisoner for smoking marijuana. If you ignore the prisoner’s race then you will inflict penalties several times harsher on Black people than on White people. Race blindness produces racist outcomes.
Suppose we factor in race to create fair outcomes on average. A White citizen arrested for smoking pot gets punished 7× per use compared to a Black citizen. This is unfair to White people who are now treated harsher on the basis of skin color. We are balancing two mutually-exclusive values. Either we can punish Black people unfairly and/or we can punish White people unfairly. We cannot be simultaneously fair to both groups.
The solution to the above problem is “stop arresting Black people disproportionately for the same crime” but that only solves the problem for crimes where different races have the same base rates. What happens when Black people actually do commit more crimes?
Suppose we’re designing a system to decide which convicts to offer parole. Correctly predicting who will go on to commit more crimes prevents crime. We want our algorithm to be as accurate as possible. Race has predictive value. If we ignore race, violent crimes will be committed against innocent people. We want the algorithm to be as fair as possible. We also want it to treat people of different races the same way.
It is mathematically impossible to satisfy both criteria simultaneously. If we maximize accuracy then Black people will be offered parole less frequently than White people with identical criminal records. If we maximize racial fairness then we lose predictive accuracy; White people have to commit fewer crimes than their Black counterparts to earn equivalent treatment.
In both cases we need to know peoples’ races.
Race correlates with recidivism. If you want to minimize recidivism then it’s useful to know potential parolees’ races.
Race correlates with lots of other measurable attributes. If we want to be race-blind it’s not enough to just erase the “race” column from our training data. We must control for everything that correlates with race. Knowing peoples’ races is a prerequisite to designing a race-blind system.
“The most robust fact in the research area,” [Moritz] Hardt says, “is that fairness through blindness doesn’t work. That’s the most established and most robust fact in the entire research area.”
I like that computer algorithms are imposing racist judgments on vulnerable populations. Algorithms may be hard to debug but at least they are possible to debug. For thousands of years we have relied on the whims of human judges. Upgrading from opaquely racist humans to transparently racist algorithms is a giant step forward for society!
Embedding Human Values
Everything we’ve covered so far refers to simple algorithms. What if we just told the computer “do the right thing”? The phrase “do the right thing” is written in English. English is sexist. You can observe the sexism of English by dumping English text into word2vec and then doing arithmetic on it.
This is another manifestation of the accuracy-vs-fairness problem we observer earlier. Nurses are overwhelmingly female. Either a language reflects this or it doesn’t. If the language reflects the fact that nurses are mostly female then your language is sexist. If your language ignores real-world phenomena then your map deviates from the territory.
Transparency
After explaining the societal tradeoffs of algorithms, Brian Christian writes a chapter about interpretability in machine learning. The core idea is that it’s useful to have tools that show us what’s going on inside machine learning algorithms. Chapter 3: Transparency is the best machine learning textbook I’ve read in a long time, which is weird because it includes neither equations nor computer code.
Here is the most important paragraph.
When Yahoo’s vision team open-sourced a model used to detect whether an uploaded image was pornographic, UC Davis PhD student Gabriel Goh used this generative method to tune static into shapes the network regarded as maximally “not safe for work.” The result was like pornography by Salvador Dalí. If you optimize for some combination of the obscenity filter and normal ImageNet category labels―for instance, volcanoes―you get, in this case, obscene geography: what look like giant granite phalluses, ejaculating clouds of volcanic ash.
Brian Christian is such a tease. He doesn’t reproduce any of the images that go with this description! The original paper has even been taken down from GitLab. A backup is available here [NSFW] on Archive.org.
You’re welcome.
Reinforcement Learning
Having already turned social justice into mathematics and explained how to debug a neural network, Brian Christian explains operant conditioning as it applies to machine learning. This chapter is weak in the sense that it is does not comprehensively explain the entire art of animal training (for that, read Don’t Shoot the Dog by Karen Prior). The Alignment Problem isn’t the best introductory psychology textbook I’ve ever read. But it’s not the worst either. Which is impressive for a book about machines. It’s not trying to be an introductory psychology textbook.
Animals are good at reinforcement learning. Machines aren’t because feedback is “terse”, “not especially constructive” and “delayed”―all of which throws a wrench in stocastic gradient descent. The delay especially makes stocastic gradient descent difficult because it feeds a combinatorial explosion in hypothesis space. We’re not sure what to use instead because we don’t know how the brain works.
Brian Christian seems to subscribe to the idea that the biological brain is based around reinforcement learning against deviations from expectation. I agree. The fact animals are reinforcement learners is observable just from black-box behaviorist experiments. The “deviations from expectation” is a synonym for predictive processing. All of this is just setup for the thesis of The Alignment Problem. What happens when you implement sample efficient reinforcement learning in a powerful machine intelligence.
Reinforcement learning in its classical form takes for granted the structure of the rewards in the world and asks the question of how to arrive at the behavior―the “policy”―that maximally reaps them. But in many ways this obscures the more interesting―and more dire―matter that faces us at the brink of AI. We find ourselves rather more interested in the exact opposite of this question: Given the behavior we want from our machines, how do we structure the environment’s rewards to bring that behavior about? How do we get what we want when it is we who sit in the back of the audience, in the critic’s chair―we who administer the food pellets, or their digital equivalent?
This is the alignment problem, in the context of a reinforcement learner.
Brian Christian does a good job of stating the alignment problem as it is understood by the mainstream AI Safety community (insofar as “AI Safety” can be considered “mainstream”).
If you want an employee to behave creatively and ethically, just using punishments and rewards doesn’t work. A human, a cat or even a machine will try hack your system. Operant conditioning is just one tool in the toolbox. If we want to build a superintelligence we need more robust tools.
Curiosity
Dogs and dolphins trained by human beings are motivated by more than just treats. They’re bored. Plus, dogs like making people happy.
Reinforcement learning doesn’t work if the rewards are too infrequent. One way to get around this is to manually shape behavior (which is fraught with risks). Another way is to program curiosity. This chapter rounds out the extremism of the Skinnerian chapter. Brian Christian conceives of intrinsic motivation as “novelty or surprise or some other related scheme”.
This bit is really cool. You can get extremely powerful videogame-playing AIs when you motivate them solely by novelty and ignore score entirely because dying in a videogame reverts you to the start screen which is boring.
If this sounds like how human children play videogames…. Well, yeah.
Imitation
The other way to get a machine to do what you want is to do it yourself and tell the machine to copy you…like a human child.
Corrigibility
Corrigibility is the quality of human being being able to pull the plug on a machine. Corrigibility is easy enough to build for machines with constrained views of reality. It is hard to maintain corrigibility for intelligent or superintelligent agents attempting to optimize the real world because human interference is an obstacle for a machine to route around. This is the heart of the alignment problem so popular on Less Wrong.
The ultimate solution is “a machine which embodies human preferences”. This is is hard both on technical grounds and philosophical grounds. Human values are complex, under-specified, built on top of ontological prejudices, produce mathematically irreconcilable internal contradictions within individual people and differ between various people. They evolve over time. Plus there’s the issue of moral uncertainty. Add on the technical puzzles of building a machine to embody those ideas and the challenge is barely on the side of possible.
But if we can do it we will have brought God to a godless universe.
Credits
This post was funded by Less Wrong. Thank you!
If anyone was wondering, no journalists bothered reporting this, but that system classified white people as ‘dogs’ and ‘seals’.
The linked article is interesting, and also suggests that it’s not as simple as
because the issue isn’t simply “our system sometimes misclassifies people as animals”, it’s “our system sometimes misclassifies people as animals, and one not-so-rare case of this happens to line up with an incredibly offensive old racist slur”—and that last bit is a subtle fact about human affairs that there’s no possible way the system could have learned from looking at labelled samples of images. The dataset had a good mix of races in it; humans do look rather like other great apes; in the absence of the long horrible history of racism this misclassification might have been benign. To do better the system would need, in some sense, to know about racism.
Maybe the best one can do really is something like artificially forbidding classifications like “ape” and “gorilla” and “monkey” unless the activations for classifications like “human” are very very low, at least until we have an image classifier that’s genuinely intelligent and understands human history.
(There are probably a lot of other misclassifications that are anomalously offensive, though few will have the weight of centuries of slavery and racist abuse behind them. Fixing them would also require the system to “know” about details of human history, and again the best one can do might be to push down certain activations when it’s at all possible that the image might be of people.)
For anyone not clicking to read the article: its author “was technical leader for Google’s social efforts (including photos)” at the time, and doesn’t cite any public sources for the information. So we should at least consider how that’s going to colour their interpretation/representation of the information.
They don’t mention how often black people were classified as gorillas, and how that compared to white people being classified as dogs or seals. It could be that for every thousand cases of the former, there was one of the latter, or it may be one for one. My sibling comment says “The dataset had a good mix of races in it” (which I take to mean there was some reasonable proportional representation of races) - the article doesn’t claim that. It says “the training data included a wide range of people of all races and colors”, contrasting that with HP webcams where “the training data for “faces” had been composed exclusively of white people”—so it clears the bar of not being exclusively white, but we don’t know by how much. In fact, the article goes on to say (due to photography practices) “our standards for what constitute “good images” still overwhelmingly favor white faces rather than black ones.”
I’m writing this rather nitpicky comment because this is the top comment with rather strong wording (“no journalists bothered reporting this, but that system classified white people as ‘dogs’ and ‘seals’”), that on another day I might have just taken on faith (especially if I’d seen it was from gwern, which I didn’t at first) - I would have assumed the link contained a study, or at least images of results pages, and contained solid additional information about these results from a third party.
So, how many third parties reported about the classification and how trustworthy were they? How many studies were conducted on the classification of black people as gorillas? What should we make of an ecosystem which tells us on a literally daily to weekly basis (google the term) about the gorillas, but never, ever tells you about the seals (I only learned about that one because I was reading the Google expert’s post for other reasons)? What should we infer about the epistemics and justifications of the various experts and reporting here?
I’m writing this rather nitpicky comment because this is the top comment replying with rather strong wording about sourcing and studies and double standards for reporting...
This is an unnecessarily snarky addition to the comment that’s disappointing to see (and doesn’t even make sense, since mine is neither a top comment, nor does it mention studies). In case you interpreted the “especially if I’d seen it was from gwern” in a negative way, I meant it as a factual statement that I saw you as a person with high respect and trust and hence assigned high prior confidence to things from you.
The original post was by a disinterested third party sharing a screenshot. However small the level of evidence provided by that is, an offhand statement by someone literally involved in the project and with no attached evidence at all is obviously much weaker.
Let’s simulate two worlds:
In world (A), Google’s PR team were so incompetent that they did not mention this “white people misidentified as seals” at the time, even to say “our system sometimes misidentifies people as animals, for eg. <photo of white people with seal tag>, and we are improving the system”—which would have softened the PR blow significantly. Users see white people tagged as animals, but they never ever share it; or they do, but no one bothers to report it, not even a tabloid with space to fill and low standards, not even contrarian media that carries “All Lives Matter” articles and would love to use any “attack on whiteness”; and the screenshot doesn’t go viral either (despite “bots are funnily dumb” being a favourite meme category).
In world (B), the “white people tagged as seals” happens either only in obviously-distorted or blurry photos, or only in an internal test system that never even got out of Google, or in some other not-usable-for-PR way. Journalists do not report on it because they don’t see it. A Googler writes a non-official Medium article that’s not focused on this, two years after the fact, and mentions it in a couple of sentences offhandedly. Perhaps one or two journalists happen to read it for other reasons, just like you, but it’s mentioned as a past bug that’s likely fixed, and there’s no supporting evidence, nothing to show their editor as a potential article, so they move on.
With the evidence available to us, something similar to world (B) seems much more likely than world (A).
How should I reply to such a flagrant double standard, where a Twitter screenshot calling out Google is incontrovertible ‘disinterested’ evidence never to be questioned, and any objection is instead required to be multiple independent third-party studies?
They were not disinterested in the least! They were specifically ‘calling out’ and shaming Google for it, and it worked brilliantly in earning them huge fake internet points. (Someone who left Google and is mentioning it in an aside years later which no one noticed, that’s much closer to disinterested.)
No, it doesn’t. A is vastly more plausible. Happens every time. You don’t believe the seal thing? Fine, look at the recent Twitter cropping thing! You see anyone pointing out that the social media campaigns about how biased the cropping algorithm were wildly incorrect and exaggerated in every way, and missed the actual pro-woman biases that Twitter’s followup study showed? (You wanted a study...) Or Tay! AFAICT, Tay could not learn, and so the media narrative is impossible. Did you hear that from anyone yet? Or how about the ‘arrested because facial recognition software accused them of shoplifting’? It’s literally in the original media article that they were arrested because a human told the cops to do so; did you hear that from anyone yet? World B is exactly what happens frequently. Did you not pay attention to how things like Timnit Gebru’s ultimatum was completely laundered out of media accounts? Or how about Mitchell, where media outlets quoting the Google statement edited out the part of the statement mentioning, y’know, what she did to get fired (dumping docs to leak to friendly reporters)? You’ve seen the factoids about how much CO2 training a DL model costs, did you see any of the followups like “oops we overestimated the cost by 10,000%” or “actually the cost is 0 because the datacenters use renewable energy”? How about that time that a prominent activist and Nvidia VP shared a literal enemies list for her followers to coordinate attacks on, which you could earn membership on for liking the wrong tweet? Do you not pay any attention to how activists and the media work? Have you not noticed the techlash or the ideological affiliations of reporters? We live in world A, not world B.
I believe there’s a conflict-vs-mistake thing going on here. I saw this as a specific case where the probabilities don’t line up the way your comment assumes they do. You seem to be from the beginning assuming I’m opposing an entire worldview and seeing it as some attack.
Behold, the Straw Man! Today for his trick, he turns “However small X’s level of evidence is, Y has even weaker evidence” into “X is incontrovertible evidence never to be questioned”.
I considered the original tweet, the fact that Google did not refute it, and the claim that Google blocked ape-related tags from Photos for years afterwards (with its own attached probability), and updated my inner measure of probability of this being true. And I find that the statement from the ex-Googler provides even weaker evidence to update based on. Nothing is “never to be questioned” here.
Again a thing I didn’t say, and you keep repeating.
A “claim from a disinterested party” means someone previously disinterested, someone where our prior assumptions can be close to assuming they are disinterested. A police officer is considered disinterested in a case if their family and friends are not involved in the case, and so can be assigned the case. It makes no sense to say “they’ve been assigned to the case now, so they’re not a disinterested party”!
This is the original tweet pointing out the issue. To me it just looks like a user casually pointing out a problem with a tool they use. Not everyone is obsessed all the time with culture wars and internet points.
Oh by the way, I found evidence that this author tweeted this seals claim back at the time as part of the Twitter thread about this. Still without any links or images, but that helped update my probabilities a little bit (as I had believed that part of the problem might be that it came two years later, as mentioned before). I wish that’s what this conversation had consisted of, actual evidence to try to arrive at the truth, instead of straw men and outright false claims.
The last paragraph has little to do with the claims here, unless you’re dumping an entire opposing worldview on me, and arguing against that imaginary person. For the record, I am much closer to your worldview regarding these issues and have noticed most of the things you mentioned. It’s just that in this instance even with that background there isn’t good enough evidence to believe the media suppressed some narrative.
My original comment has served its purpose to provide additional context for those who want it, and I don’t think further discussion with you here will be productive. Thanks for all the fish.
Lol, I had the exact same experience while reviewing the book. My most memorable one was when he introduced the critiques against COMPAS, and I thought he really should mention the case in favor of COMPAS, and I wrote a bunch of notes about why. He then did exactly that some number of pages later.
It would add some possibly-useful context to this review if you explained why you came to it with an axe to grind. (Just as race is both possibly-useful information and a possible source of prejudice to correct for, so also with your prior prejudices about this book.)
Much of the dialogue about AI Safety I encounter in off-the-record conversations seems to me like it’s not grounded in reality. I repeatedly hear (what I feel to be) a set of shaky arguments that both shut down conversation and are difficult to validate empirically.
The shaky argument is as follows:
Machine learning is rapidly growing more powerful. If trends continue it will soon eclipse human performance.
Machine learning equals artificial intelligence equals world optimizer.
World optimizers can easily turn the universe into paperclips by accident.
Therefore we need to halt machine learning advancement until the abstract philosophical + mathematical puzzle of AI alignment is solved.
I am not saying this line of reasoning is what AI researchers believe or that it’s mainstream (among the rationality/alignment communities)―or even that it’s wrong. The argument annoys me for the same reason a popular-yet-incoherent political platform annoys me; I have encountered badly-argued versions of the idea too many times.
I agree with #1, though I quibble “absolute power” should be distinguished from “sample efficiency” as well as how we’ll get to superintelligence. (I am bearish on applying the scaling hypothesis to existing architectures.) I agree with #3 in theory. Theory is often very different from practice. I disagree with #2 because it relies on the tautological equivalence of two definitions. I can imagine superintelligent machines that aren’t world optimizers. Without #2 the argument falls apart. It might be easy to build a superintelligence but hard to build a world optimizer.
I approached The Alignment Problem with the (incorrect) prior that it would be more vague abstract arguments untethered from technical reality. Instead, the book was dominated by ideas that have passed practical empirical tests.
Thanks! (I would not have guessed correctly.)
This paragraph has two typos:
Fixed. Thanks.
Presumably the second usage of “Black” is supposed to be “White.”
Fixed. Thanks.
This is assuming the surveys are reliable. As far as I know, it’s been found that black people are more likely to lie about committing crime than white people? Scott Alexander discusses this problem briefly in this blog post.