# Critiquing “What failure looks like”

I find myself somewhat confused as to why I should find Part I of “What failure looks like” (hereafter “WFLL1”, like the pastry) likely enough to be worth worrying about. I have 3 basic objections, although I don’t claim that any are decisive. First, let me summarize WFLL1 as I understand it:

In general, it’s easier to optimize easy-to-measure goals than hard-to-measure ones, but this disparity is much larger with ML models than with humans and human-made institutions. As special-purpose AI becomes more powerful, this will lead to a form of differential progress where easy-to-measure goals become optimized well past the point when they correlated with what we actually want.

(See also: this critique, although I agree with the existing rebuttals to it).

# Objection 1: Historical precedent

In the late 1940s, George Dantzig invented the simplex algorithm, a practically efficient method for solving linear optimization problems. At the same time, the first modern computers were coming around, which he had access to as a mathematician in the US military. For Dantzig and his contemporaries, a wide class of previously intractable problems suddenly became solvable, and they did use the new methods to great effect, playing a major part in developing the field of operations research.

With the new tools in hand, Dantzig also decided to use simplex to optimize his diet. After carefully poring over prior work, and putting in considerable effort to obtain accurate data and correctly specify the coefficients, Dantzig was now ready, telling his wife:

whatever the [IBM] 701 says that’s what I want you to feed me each day starting with supper tonight.

The result included 500 gallons of vinegar.

After delisting vinegar as a food, the next round came back with 200 boullion cubes/​day. There were several more iterations, none of which worked, and after everything Dantzig simply went with a “common-sense” diet.

The point I am making is, whenever we create new methods for solving problems, we end up with a bunch of solutions looking for problems. Typically, we try to apply those solutions as widely as possible, and then quickly notice when some of those solutions don’t solve the problems we actually want to solve.

Suppose that around 1950, we were musing about the potential consequences of the coming IT revolution. We might’ve noticed that we were entering the era of the algorithm, where a potentially very wide class of problems could be solved—if they could be reduced to arithmetic and run on the new machines, with their scarcely fathomable ability to memorize a lot and calculate in mere moments. And we could ask “But what about love, honor or justice? Will we forget about those unquantifiable things in the era of the algorithm?” [excuse me if this phrasing sounds snarky] And yet, in the decades since, we seem to have basically just used computers to solve the problems we actually want to solve, and we don’t seem to have stopped valuing the things that aren’t under their scope.

If we round off WFLL1 to “when you have a hammer, everything looks like a nail”, then this only seems mildly and benignly true in the case of most technologies, i.e. the trend seems to be that if technology A makes us better at doing some class of tasks X, we poke around to see just how big X is, until we’ve delineated the border well and stop, with the exploratory phase rarely causing large-scale harm.

I don’t think the OP is intending WFLL1 to say something this broad, but then I feel it should be clarified why “this time is different”, such as why modern D(R)L should be fundamentally different from linear optimization, the IT revolution, or even non-deep ML.

(I think the discontinuity-based arguments largely do make the “this time is different” case, roughly because general intelligence seems clearly game-changing. WFLL2 seems somewhere in between these, and I’m unsure where my beliefs fall on that.)

# Objection 2: Absence of evidence

I don’t see any particular evidence of this as of WFLL1 unfolding as we conclude the 2010s. As I understand, it should gradually “show up” well before AGI, but given that we already have a lot of ML already deployed, this at least causes one to ask when this should be expected to be noticeable, in terms of the necessary capabilities of the AI systems.

# Objection 3: Why privilege this axis (of differential progress)

It seems likely that if ML continues to advance substantially over the coming decades (as much as the rate 2012-2019), then it will cause substantial differential progress. But along what axes? WFLL1 singles out the axis “easy-to-measure vs. hard-to-measure”, and it’s not clear to me why we should worry about this in particular.

For instance, there’s also the axis “have massive datasets vs. don’t have massive datasets”. And we could point to various examples of this form, e.g. it’s easy to measure a country’s GDP year over year, but we can get at most a few hundred data points on this, hence it’s completely unsuitable for DL. So, for instance, we could see differential progress on microeconomics vs. macroeconomics.

More generally, we could ask what things DL seems weak at:

• Performance at the task must be easy to measure

• A massive, labelled, digitized training set must exist (or can be easily made with e.g. self-play)

• DL seems relatively weak at learning causality

• (Other things listed by e.g. Gary Marcus)

And from there, we could reasonably extrapolate to what DL will be good/​bad at, relative to the baseline of human thinking/​heuristics.

WFLL1 seems to basically say: “here’s this axis of differential progress (arising from a limitation of DL), and here are some examples of ways things can go wrong as a result”. But for any other limitation we list, I’d suspect we can also list examples such as “if DL is really capable in general but really bad at causal modeling, here’s a thing that can go wrong.”

At least to me, the ease-of-measurement bullet point does not seem to pop out as a very natural category: if interpreted broadly, it does not capture everything that seems plausibly important, and if interpreted narrowly, it does not seem narrow enough to focus our attention on any one interesting failure mode.

No nominations.
No reviews.
• Objection 1: Historical precedent

I’m pretty sure WFLL1 only applies in the case where AI is “responsible for” some very large fraction of the economy (I imagine >90%), for which we don’t really have much of a historical precedent.

And we could ask “But what about love, honor or justice? Will we forget about those unquantifiable things in the era of the algorithm?”

When I imagine WFLL1 that doesn’t turn into WFLL2, I usually imagine a world in which all existing humans lead great lives, but don’t have much control over the future. On a moment-to-moment basis, that world is better than the current world, but we don’t get to influence the future and make use of the cosmic endowment, and so from a total view we have lost >99% of the potential value of the future. Such a world can still include love, honor and justice among the humans who are still around.

On the other hand, the last time I mentioned this among ~6 people, all at least interested in AI safety, not a single other person shared this impression, but still found WFLL1 convincing as an example of a world that was moment-to-moment worse than the current world, but still not WFLL2.

Objection 2: Absence of evidence

AI has a very minor economic impact right now, but even so, I’d argue that the concerns over fairness and bias in AI are evidence of WFLL1, since we can’t measure the “fairness” of a classifier.

Objection 3: Why privilege this axis

Mostly that for all the other axes you name, I expect deep learning to eventually become capable of doing those axes. To be fair, I also think that deep learning models will be able to do what we mean rather than what we measure, but that seems like the one most likely to fail. (I do find the dataset axis somewhat convincing, but even there I expect self-supervised learning to make that axis less important.)

• When I imagine WFLL1 that doesn’t turn into WFLL2, I usually imagine a world in which all existing humans lead great lives, but don’t have much control over the future. On a moment-to-moment basis, that world is better than the current world, but we don’t get to influence the future and make use of the cosmic endowment, and so from a total view we have lost >99% of the potential value of the future.

The availability of AI still probably increases humans’ absolute wealth. This is a problem for humans because we care about our fraction of influence over the future, not just our absolute level of wealth over the short term.
• First off, if we have specific evidence (an answer to Objection 2) then the historical analogy in Objection 1 looks a lot weaker, as any real evidence of WFLL1 arising now would suggest that the historical cases of other algorithms that gave pathological results just aren’t representative. I think they aren’t representative.

(I think the discontinuity-based arguments largely do make the “this time is different” case, roughly because general intelligence seems clearly game-changing. WFLL2 seems somewhere in between these, and I’m unsure where my beliefs fall on that.)

The key difference ‘this time’ (before we get anywhere near WFLL2 or AGI), as I see it, is that those early algorithms give recommendations to people that they could implement or avoid, so the ‘exploratory phase’ where we poked around to find out what they were capable of was pretty much risk-free, while WFLL1 implies that the systems have some degree of autonomy and actually have a chance to do unexpected things without humans realizing straight away. Danzig’s linear optimization leading to a catastrophe would have required more carelessness and stupidity than (current or very near-future) deep RL, because deep RL’s mistakes are subtler and because it has to be loosed on some environment to achieve results and give us useful information on its behaviour. As for evidence, Stuart Russell thinks that we are already seeing WFLL1 in social media ad algorithms:

“Consider the so-called ‘filter bubble’ of social media. The reinforcement learning algorithm is trying to maximize click throughs. From the view of the human, the purpose of the machine is to maximize clickthroughs. But from the view of the machine, it is changing the state of the world to maximize clicks. It is changing you to make you more predictable. A raving fascist or communist is more predictable and will lap up raving content. The machines can change our mind about our objective function so we are easier to satisfy. Advertisers have done this for decades.” [I argued with him about this feedback loop, and Yann Le Cun says this changed at Facebook a while ago]

“The reinforcement learning algorithm in social media has destroyed the EU, NATO and democracy. And that’s just 50 lines of code.”

I wonder if this hypothesis was in Paul’s mind when he wrote the essay. If Russell is right about any of this that suggests that one of the first times we gave deep RL any ability to influence the world it succumbed to a failure scenario almost immediately. That’s not a good track record.

• A raving fascist or communist is more predictable and will lap up raving content. The machines can change our mind about our objective function so we are easier to satisfy.

That’s a good way to put it!

This might be stretching the analogy, but I feel like there’s a similar thing going on with technological evolution of “gadgets” (digital watch, iPod, cell phone). It feels like people’s expectations of what a gadget should be able to do for them to make them content continue to grow at a rate so fast that something as simple and obviously beneficial as “battery life” never really receives an improvement. I get that not everyone is bothered by having to charge things all the time (and losing the charger all the time), but how come it’s borderline impossible to buy things that don’t need to be charged so often? It feels like there’s some optimization pressure at work here, and it’s not making life more convenient. :)

• For people who share the intuition voiced in the OP, I’m curious if your intuitions change after thinking about the topic of recommender systems and filter bubbles in social media. Especially as portrayed in the documentary “The Social Dilemma” (summarized in this Sam Harris podcast). Does that constitute a historical precedent?

• Suppose it was easy to create automated companies, and skim a bit off the top. AI algorithms are just better at buisness than any startup founder. Soon some people create these algorithms, give them a few quid in seed capitat and leave them to trade and accumulate money. The algorithms rapidly increase their wealth, and soon own much of the world economy. Humans are removed when the AIs have the power to do so at a profit. This ends in several superintelligences tiling the universe with economium together.

For this to happen, we need

1) Doubling time of fooming AI months to years, to allow many AI’s to be in the running.

2) Its fairly easy to set an AI to maximize money.

3) The people that care about complex human values can’t effectively make an AI to do that.

4) Any attempts to stamp out all fledgling AIs before they get powerful fails. Helped by anonymous cloud computing.

I don’t really buy 1) , but it is fairly plausible, I’m not convinced of 2) either, although it might not be hard to build a mesa optimiser that cares about something sufficiently correlated with money, that humans are beyond caring before any serious deviation from money optimization happens.

If 2) were false, and people who tried to make AI’s all got paperclip maximisers, the long run result is just a world filled with paperclips not banknotes. (Although this would make coordinating to destroy the AI’s a little easier?) The paperclip maximisers would still try to gain economic influence until they could snap nanotech fingers.