Rohin Shah 19 Sep 2020 2:15 UTC
LW: 97 AF: 37
AF
on: Draft report on AI timelines
Planned summary for the Alignment Newsletter:
Once again, we have a piece of work so large and detailed that I need a whole newsletter to summarize it! This time, it is a quantitative model for forecasting when transformative AI will happen.
The overall framework
The key assumption behind this model is that if we train a neural net or other ML model that uses about as much computation as a human brain, that will likely result in transformative AI (TAI) (defined as AI that has an impact comparable to that of the industrial revolution). In other words, we _anchor_ our estimate of the ML model’s inference computation to that of the human brain. This assumption allows us to estimate how much compute will be required to train such a model _using 2020 algorithms_. By incorporating a trend extrapolation of how algorithmic progress will reduce the required amount of compute, we can get a prediction of how much compute would be required for the final training run of a transformative model in any given year.
We can also get a prediction of how much compute will be _available_ by predicting the cost of compute in a given year (which we have a decent amount of past evidence about), and predicting the maximum amount of money an actor would be willing to spend on a single training run. The probability that we can train a transformative model in year Y is then just the probability that the compute _requirement_ for year Y is less than the compute _available_ in year Y.
The vast majority of the report is focused on estimating the amount of compute required to train a transformative model using 2020 algorithms (where most of our uncertainty would come from); the remaining factors are estimated relatively quickly without too much detail. I’ll start with those so that you can have them as background knowledge before we delve into the real meat of the report. These are usually modeled as logistic curves in log space: that is, they are modeled as improving at some constant rate, but will level off and saturate at some maximum value after which they won’t improve.
Algorithmic progress
First off, we have the impact of _algorithmic progress_. <@AI and Efficiency@> estimates that algorithms improve enough to cut compute times in half every 16 months. However, this was measured on ImageNet, where researchers are directly optimizing for reduced computation costs. It seems less likely that researchers are doing as good a job at reducing computation costs for “training a transformative model”, and so the author increases the **halving time to 2-3 years**, with a maximum of **somewhere between 1-5 orders of magnitude** (with the assumption that the higher the “technical difficulty” of the problem, the more algorithmic progress is possible).
Cost of compute
Second, we need to estimate a trend for compute costs. There has been some prior work on this (summarized in [AN #97](https://mailchi.mp/a2b5efbcd3a7/an-97-are-there-historical-examples-of-large-robust-discontinuities)). The report has some similar analyses, and ends up estimating **a doubling time of 2.5 years**, and a (very unstable) maximum of improvement by **a factor of 2 million by 2100**.
Willingness to spend
Third, we would like to know the maximum amount (in 2020 dollars) any actor might spend on a single training run. Note that we are estimating the money spent on a _final training run_, which doesn’t include the cost of initial experiments or the cost of researcher time. Currently, the author estimates that all-in project costs are 10-100x larger than the final training run cost, but this will likely go down to something like 2-10x, as the incentive for reducing this ratio becomes much larger.
The author estimates that the most expensive run _in a published paper_ was the final <@AlphaStar@>(@AlphaStar: Mastering the Real-Time Strategy Game StarCraft II@) training run, at ~1e23 FLOP and $1M cost. However, there have probably been unpublished results that are slightly more expensive, maybe $2-8M. In line with <@AI and Compute@>, this will probably increase dramatically to about **$1B in 2025**.
Given that AI companies each have around $100B cash on hand, and could potentially borrow additional several hundreds of billions of dollars (given their current market caps and likely growth in the worlds where AI still looks promising), it seems likely that low hundreds of billions of dollars could be spent on a single run by 2040, corresponding to a doubling time (from $1B in 2025) of about 2 years.
To estimate the maximum here, we can compare to megaprojects like the Manhattan Project or the Apollo program, which suggests that a government could spend around 0.75% of GDP for ~4 years. Since transformative AI will likely be more valuable economically and strategically than these previous programs, we can shade that upwards to 1% of GDP for 5 years. Assuming all-in costs are 5x that of the final training run, this suggests the maximum willingness to spend should be 1% of GDP of the largest country, which we assume grows at ~3% every year.
Strategy for estimating training compute for a transformative model
In addition to the three factors of algorithmic progress, cost of compute, and willingness to spend, we need an estimate of how much computation would be needed to train a transformative model using 2020 algorithms (which I’ll discuss next). Then, at year Y, the compute required is given by computation needed with 2020 algorithms * improvement factor from algorithmic progress, which (in this report) is a probability distribution. At year Y, the compute available is given by FLOP per dollar (aka compute cost) * money that can be spent, which (in this report) is a point estimate. We can then simply read off the probability that the compute required is greater than the compute available.
Okay, so the last thing we need is a distribution over the amount of computation that would be needed to train a transformative model using 2020 algorithms, which is the main focus of this report. There is a lot of detail here that I’m going to elide over, especially in talking about the _distribution_ as a whole (whereas I will focus primarily on the median case for simplicity). As I mentioned early on, the key hypothesis is that we will need to train a neural net or other ML model that uses about as much compute as a human brain. So the strategy will be to first translate from “compute of human brain” to “inference compute of neural net”, and then to translate from “inference compute of neural net” to “training compute of neural net”.
How much inference compute would a transformative model use?
We can talk about the rate at which synapses fire in the human brain. How can we convert this to FLOP? The author proposes the following hypothetical: suppose we redo evolutionary history, but in every animal we replace each neuron with N [floating-point units](https://en.wikipedia.org/wiki/Floating-point_unit) that each perform 1 FLOP per second. For what value of N do we still get roughly human-level intelligence over a similar evolutionary timescale? The author then does some calculations about simulating synapses with FLOPs, drawing heavily on the <@recent report on brain computation@>(@How Much Computational Power It Takes to Match the Human Brain@), to estimate that N would be around 1-10,000, which after some more calculations suggests that the human brain is doing the equivalent of 1e13 − 1e16 FLOP per second, with **a median of 1e15 FLOP per second**, and a long tail to the right.
Does this mean we can say that a transformative model will use 1e15 FLOP per second during inference? Such a model would have a clear flaw: even though we are assuming that algorithmic progress reduces compute costs over time, if we did the same analysis in e.g. 1980, we’d get the _same_ estimate for the compute cost of a transformative model, which would imply that there was no algorithmic progress between 1980 and 2020! The problem is that we’d always estimate the brain as using 1e15 FLOP per second (or around there), but for our ML models there is a difference between FLOP per second _using 2020 algorithms_ and FLOP per second _using 1980 algorithms_. So how do we convert form “brain FLOP per second” to “inference FLOP per second for 2020 ML algorithms”?
One approach is to look at how other machines we have designed compare to the corresponding machines that evolution has designed. An [analysis](https://docs.google.com/document/d/1HUtUBpRbNnnWBxiO2bz3LumEsQcaZioAPZDNcsWPnos/edit) by Paul Christiano concluded that human-designed artifacts tend to be 2-3 orders of magnitude worse than those designed by evolution, when considering energy usage. Presumably a similar analysis done in the past would have resulted in higher numbers and thus wouldn’t fall prey to the problem above. Another approach is to compare existing ML models to animals with a similar amount of computation, and see which one is subjectively “more impressive”. For example, the AlphaStar model uses about as much computation as a bee brain, and large language models use somewhat more; the author finds it reasonable to say that AlphaStar is “about as sophisticated” as a bee, or that <@GPT-3@>(@Language Models are Few-Shot Learners@) is “more sophisticated” than a bee.
We can also look at some abstract considerations. Natural selection had _a lot_ of time to optimize brains, and natural artifacts are usually quite impressive. On the other hand, human designers have the benefit of intelligent design and can copy the patterns that natural selection has come up with. Overall, these considerations roughly balance each other out. Another important consideration is that we’re only predicting what would be needed for a model that was good at most tasks that a human would currently be good at (think a virtual personal assistant), whereas evolution optimized for a whole bunch of other skills that were needed in the ancestral environment. The author subjectively guesses that this should reduce our estimate of compute costs by about an order of magnitude.
Overall, putting all these considerations together, the author intuitively guesses that to convert from “brain FLOP per second” to “inference FLOP per second for 2020 ML algorithms”, we should add an order of magnitude to the median, and add another two orders of magnitude to the standard deviation to account for our large uncertainty. This results in a median of **1e16 FLOP per second** for the inference-time compute of a transformative model.
Training compute for a transformative model
We might expect a transformative model to run a forward pass **0.1 − 10 times per second** (which on the high end would match human reaction time of 100ms), and for each parameter of the neural net to contribute **1-100 FLOP per forward pass**, which implies that if the inference-time compute is 1e16 FLOP per second then the model should have **1e13 − 1e17 parameters**, with a median of **3e14 parameters**.
We now need to estimate how much compute it takes to train a transformative model with 3e14 parameters. We assume this is dominated by the number of times you have to run the model during training, or equivalently, the number of data points you train on times the number of times you train on each data point. (In particular, this assumes that the cost of acquiring data is negligible in comparison. The report argues for this assumption; for the sake of brevity I won’t summarize it here.)
For this, we need a relationship between parameters and data points, which we’ll assume will follow a power law KP^α, where P is the number of parameters and K and α are constants. A large number of ML theory results imply that the number of data points needed to reach a specified level of accuracy grows linearly with the number of parameters (i.e. α=1), which we can take as a weak prior. We can then update this with empirical evidence from papers. <@Scaling Laws for Neural Language Models@> suggests that for language models, data requirements scale as α=0.37 or as α=0.74, depending on what measure you look at. Meanwhile, [Deep Learning Scaling is Predictable, Empirically](https://arxiv.org/abs/1712.00409) suggests that α=1.39 for a wide variety of supervised learning problems (including language modeling). However, the former paper studies a more relevant setting: it includes regularization, and asks about the number of data points needed to reach a target accuracy, whereas the latter paper ignores regularization and asks about the minimum number of data points that the model _cannot_ overfit to. So overall the author puts more weight on the former paper and estimates a median of α=0.8, though with substantial uncertainty.
We also need to estimate how many epochs will be needed, i.e. how many times we train on any given data point. The author decides not to explicitly model this factor since it will likely be close to 1, and instead lumps in the uncertainty over the number of epochs with the uncertainty over the constant factor in the scaling law above. We can then look at language model runs to estimate a scaling law for them, for which the median scaling law predicts that we would need 1e13 data points for our 3e14 parameter model.
However, this has all been for supervised learning. It seems plausible that a transformative task would have to be trained using RL, where the model acts over a sequence of timesteps, and then receives (non-differentiable) feedback at the end of those timesteps. How would scaling laws apply in this setting? One simple assumption is to say that each rollout over the _effective horizon_ counts as one piece of “meaningful feedback” and so should count as a single data point. Here, the effective horizon is the minimum of the actual horizon and 1/(1-γ), where γ is the discount factor. We assume that the scaling law stays the same; if we instead try to estimate it from recent RL runs, it can change the results by about one order of magnitude.
So we now know we need to train a 3e14 parameter model with 1e13 data points for a transformative task. This gets us nearly all the way to the compute required with 2020 algorithms: we have a ~3e14 parameter model that takes ~1e16 FLOP per forward pass, that is trained on ~1e13 data points with each data point taking H timesteps, for a total of H * 1e29 FLOP. The author’s distributions are instead centered at H * 1e30 FLOP; I suspect this is simply because the author was computing with distributions whereas I’ve been directly manipulating medians in this summary.
The last and most uncertain piece of information is the effective horizon of a transformative task. We could imagine something as low as 1 subjective second (for something like language modeling), or something as high as 1e9 subjective seconds (i.e. 32 subjective years), if we were to redo evolution, or train on a task like “do effective scientific R&D”. The author splits this up into short, medium and long horizon neural net paths (corresponding to horizons of 1e0-1e3, 1e3-1e6, and 1e6-1e9 respectively), and invites readers to place their own weights on each of the possible paths.
There are many important considerations here: for example, if you think that the dominating cost will be generative modeling (GPT-3 style, but maybe also for images, video etc), then you would place more weight on short horizons. Conversely, if you think the hard challenge is to gain meta learning abilities, and that we probably need “data points” comparable to the time between generations in human evolution, then you would place more weight on longer horizons.
Adding three more potential anchors
We can now combine all these ingredients to get a forecast for when compute will be available to develop a transformative model! But not yet: we’ll first add a few more possible “anchors” for the amount of computation needed for a transformative model. (All of the modeling so far has “anchored” the _inference time computation of a transformative model_ to the _inference time computation of the human brain_.)
First, we can anchor _parameter count of a transformative model_ to the _parameter count of the human genome_, which has far fewer “parameters” than the human brain. Specifically, we assume that all the scaling laws remain the same, but that a transformative model will only require 7.5e8 parameters (the amount of information in the human genome) rather than our previous estimate of ~1e15 parameters. This drastically reduces the amount of computation required, though it is still slightly above that of the short-horizon neural net, because the author assumed that the horizon for this path was somewhere between 1 and 32 years.
Second, we can anchor _training compute for a transformative model_ to the _compute used by the human brain over a lifetime_. As you might imagine, this leads to a much smaller estimate: the brain uses ~1e24 FLOP over 32 years of life, which is only 10x the amount used for AlphaStar, and even after adjusting upwards to account for man-made artifacts being worse than those made by evolution, the resulting model predicts a significant probability that we would already have been able to build a transformative model.
Finally, we can anchor _training compute for a transformative model_ to the _compute used by all animal brains over the course of evolution_. The basic assumption here is that our optimization algorithms and architectures are not much better than simply “redoing” natural selection from a very primitive starting point. This leads to an estimate of ~1e41 FLOP to train a transformative model, which is more than the long horizon neural net path (though not hugely more).
Putting it all together
So we now have six different paths: the three neural net anchors (short, medium and long horizon), the genome anchor, the lifetime anchor, and the evolution anchor. We can now assign weights to each of these paths, where each weight can be interpreted as the probability that that path is the _cheapest_ way to get a transformative model, as well as a final weight that describes the chance that none of the paths work out.
The long horizon neural net path can be thought of as a conservative “default” view: it could work out simply by training directly on examples of a long horizon task where each data point takes around a subjective year to generate. However, there are several reasons to think that researchers will be able to do better than this. As a result, the author assigns 20% to the short horizon neural net, 30% to the medium horizon neural net, and 15% to the long horizon neural net.
The lifetime anchor would suggest that we either already could get TAI, or are very close, which seems very unlikely given the lack of major economic applications of neural nets so far, and so gets assigned only 5%. The genome path gets 10%, the evolution anchor gets 10%, and the remaining 10% is assigned to none of the paths working out.
This predicts a **median of 2052** for the year in which some actor would be willing and able to train a single transformative model, with the full graphs shown below:
<Graphs removed since they are in flux and easy to share in a low-bandwidth way>
How does this relate to TAI?
Note that what we’ve modeled so far is the probability that by year Y we will have enough compute for the final training run of a transformative model. This is not the same thing as the probability of developing TAI. There are several reasons that TAI could be developed _later_ than the given prediction:
1. Compute isn’t the only input required: we also need data, environments, human feedback, etc. While the author expects that these will not be the bottleneck, this is far from a certainty.
2. When thinking about any particular path and making it more concrete, a host of problems tend to show up that will need to be solved and may add extra time. Some examples include robustness, reliability, possible breakdown of the scaling laws, the need to generate lots of different kinds of data, etc.
3. AI research could stall, whether because of regulation, a global catastrophe, an AI winter, or something else.
However, there are also compelling reasons to expect TAI to arrive _earlier_:
1. We may develop TAI through some other cheaper route, such as a <@services model@>(@Reframing Superintelligence: Comprehensive AI Services as General Intelligence@).
2. Our forecasts apply to a “balanced” model that has a similar profile of abilities as a human. In practice, it will likely be easier and cheaper to build an “unbalanced” model that is superhuman in some domains and subhuman in others, that is nonetheless transformative.
3. The curves for several factors assume some maximum after which progress is not possible; in reality it is more likely that progress slows to some lower but non-zero growth rate.
In the near future, it seems likely that it would be harder to find cheaper routes (since there is less time to do the research), so we should probably assume that the probabilities are overestimates, and for similar reasons for later years the probabilities should be treated as underestimates.
For the median of 2052, the author guesses that these considerations roughly cancel out, and so rounds the median for development of TAI to **2050**. A sensitivity analysis concludes that 2040 is the “most aggressive plausible median”, while the “most conservative plausible median” is 2080.
Planned opinion:
I really liked this report: it’s extremely thorough and anticipates and responds to a large number of potential reactions. I’ve made my own timelines estimate using the provided spreadsheet, and have adopted the resulting graph (with a few modifications) as my TAI timeline (which ends up with a median of ~2055). This is saying quite a lot: it’s pretty rare that a quantitative model is compelling enough that I’m inclined to only slightly edit its output, as opposed to simply using the quantitative model to inform my intuitions.
Here are the main ways in which my model is different from the one in the report:
1. Ignoring the genome anchor
I ignore the genome anchor because I don’t buy the model: even if researchers did create a very parameter-efficient model class (which seems unlikely), I would not expect the same scaling laws to apply to that model class. The report mentions that you could also interpret the genome anchor as simply providing a constraint on how many data points are needed to train long-horizon behaviors (since that’s what evolution was optimizing), but I prefer to take this as (fairly weak) evidence that informs what weights to place on short vs. medium vs. long horizons for neural nets.

2. Placing more weight on short and medium horizons relative to long horizons
I place 30% on short horizons, 40% on medium horizons, and 10% on long horizons. The report already names several reasons why we might expect the long horizon assumption to be too conservative. I agree with all of those, and have one more of my own:
If meta-learning turns out to require a huge amount of compute, we can instead directly train on some transformative task with a lower horizon. Even some of the hardest tasks like scientific R&D shouldn’t have a huge horizon: even if we assume that it takes human scientists a year to produce the equivalent of a single data point, at 40 hours a week that comes out to a horizon of 2000 subjective hours, or 7e6 seconds. This is near the beginning of the long horizon realm of 1e6-1e9 seconds and seems like a very conservative overestimate to me.
(Note that in practice I’d guess we will train something like a meta-learner, because I suspect the skill of meta-learning will not require such large average effective horizons.)
3. Reduced willingness to spend
My willingness to spend forecasts are somewhat lower: the predictions and reasoning in this report feel closer to upper bounds on how much people might spend rather than predictions of how much they will spend. Assuming we reduce the ratio of all-in project costs to final training run costs to 10x, spending $1B on a training run by 2025 would imply all-in project costs of $10B, which is ~40% of Google’s yearly R&D budget of $26B, or 10% of the budget for a 4-year project. Possibly this wouldn’t be classified as R&D, but it would also be _2% of all expenditures over 4 years_. This feels remarkably high to me for something that’s supposed to happen within 5 years; while I wouldn’t rule it out, it wouldn’t be my median prediction.
4. Accounting for challenges
While the report does talk about challenges in e.g. getting the right data and environments by the right time, I think there are a bunch of other challenges as well: for example, you need to ensure that your model is aligned, robust, and reliable (at least if you want to deploy it and get economic value from it). I do expect that these challenges will be easier than they are today, partly because more research will have been done and partly because the models themselves will be more capable.
Another example of a challenge would be PR concerns: it seems very plausible to me that there will be a backlash against transformative AI systems, that results in those systems being deployed later than we’d expect them to be according to this model.
To be more concrete, if we ignore points 1-3 and assume this is my only disagreement, then for the median of 2052, rather than assuming that reasons for optimism and pessimism approximately cancel out to yield 2050 as the median for TAI, I’d be inclined to shade upwards to 2055 or 2060 as my median for TAI.
What links here?

Alignment Newsletter One Year Retrospective

Rohin Shah10 Apr 2019 6:58 UTC

94 points

31 comments21 min readLW link

Categorizing failures as “outer” or “inner” misalignment is often confused

Rohin Shah6 Jan 2023 15:48 UTC

86 points

21 comments8 min readLW link

Shah and Yudkowsky on alignment failures

Rohin Shah and Eliezer Yudkowsky

28 Feb 2022 19:18 UTC

85 points

39 comments91 min readLW link 1 review

Preface to the sequence on value learning

Rohin Shah30 Oct 2018 22:04 UTC

70 points

6 comments3 min readLW link

Alignment Newsletter #13: 07/02/18

Rohin Shah2 Jul 2018 16:10 UTC

70 points

12 comments8 min readLW link

(mailchi.mp)

Rohin Shah 13 May 2022 7:44 UTC
68 points
in reply to: spell_chekist’s comment on: Deepmind’s Gato: Generalist Agent
Not really? On timelines, I haven’t looked through the results so maybe they’re more surprising then they look on a brief skim, but “you can do multitask learning with a single network” feels totally unsurprising given past results. Like, if nothing else the network could allocate 10% of itself to each domain; 100M parameters are more than enough to show good performance in these domains (robotics often uses far fewer parameters iirc). But also I would have expected some transfer between tasks so that you’d do better than that would naively predict. I’ve seen this before—iirc there was a result (from Pieter Abbeel’s lab? EDIT: this one EDIT 2: see caveats on this paper, though it doesn’t affect my point) a couple of years ago that showed that pretraining a model on language would lead to improved sample efficiency in some nominally-totally-unrelated RL task, or something like that. ~~Unfortunately I can’t find it on a quick Google now (and it’s possible it never made it into a paper and I heard it via word of mouth).~~
Having not read the detailed results yet, I would be quite surprised if it performed better on language-only tasks than a pretrained language model of the same size; I’d be a little surprised if it performed better on robotics / RL tasks than a specialized model of the same size given the same amount of robotics data.
In general, from a “timelines to risky systems” perspective, I’m not that interested in these sorts of “generic agents” that can do all the things with one neural net; it seems like it will be far more economically useful to have separate neural nets doing each of the things and using each other as tools to accomplish particular tasks and so that’s what I expect to see.
On pessimism, I’m not sure why I should update in any direction on this result, even if I thought this was surprisingly fast progress which I don’t. I guess shorter timelines would increase pessimism just by us having less time to prepare, but I don’t see any other good reason for increased pessimism.

FAQ: Advice for AI Alignment Researchers

Rohin Shah26 Apr 2021 18:59 UTC

67 points

2 comments1 min readLW link

(rohinshah.com)

AI safety without goal-directed behavior

Rohin Shah7 Jan 2019 7:48 UTC

66 points

15 comments4 min readLW link

Rohin Shah 16 Nov 2018 21:20 UTC
LW: 65 AF: 24
AF
on: Clarifying “AI Alignment”
Ultimately, our goal is to build AI systems that do what we want them to do. One way of decomposing this is first to define the behavior that we want from an AI system, and then to figure out how to obtain that behavior, which we might call the definition-optimization decomposition. Ambitious value learning aims to solve the definition subproblem. I interpret this post as proposing a different decomposition of the overall problem. One subproblem is how to build an AI system that is trying to do what we want, and the second subproblem is how to make the AI competent enough that it actually does what we want. I like this motivation-competence decomposition for a few reasons:
- It isolates the major, urgent difficulty in a single subproblem. If we make an AI system that tries to do what we want, it could certainly make mistakes, but it seems much less likely to cause eg. human extinction. (Though it is certainly possible, for example by building an unaligned successor AI system, as mentioned in the post.) In contrast, with the definition-optimization decomposition, we need to solve both specification problems with the definition and robustness problems with the optimization.
- Humans seem to solve the motivation subproblem, whereas humans don’t seem to solve either the definition or the optimization subproblems. I can definitely imagine a human legitimately trying to help me, whereas I can’t really imagine a human knowing how to derive optimal behavior for my goals, nor can I imagine a human that can actually perform the optimal behavior to achieve some arbitrary goal.
- It is easier to apply to systems without much capability, though as the post notes, it probably still does need to have some level of capability. While a digit recognition system is useful, it doesn’t seem meaningful to talk about whether it is “trying” to help us.
- Relatedly, the safety guarantees seem to degrade more slowly and smoothly. With definition-optimization, if you get the definition even slightly wrong, Goodhart’s Law suggests that you can get very bad outcomes. With motivation-competence, I’ve already argued that incompetence probably leads to small problems, not big ones, and slightly worse motivation might not make a huge difference because of something analogous to the basin of attraction around corrigibility. This depends a lot on what “slightly worse” means for motivation, but I’m optimistic.
- We’ve been working with the definition-optimization decomposition for quite some time now by modeling AI systems as expected utility maximizers, and we’ve found a lot of negative results and not very many positive ones.
- The motivation-competence decomposition accommodates interaction between the AI system and humans, which definition-optimization does not allow (or at least, it makes it awkward to include such interaction).
The cons are:
- It is imprecise and informal, whereas we can use the formalism of expected utility maximizers for the definition-optimization decomposition.
- There hasn’t been much work done in this paradigm, so it is not obvious that there is progress to make.
- I suspect many researchers would argue that any sufficiently intelligent system will be well-modeled as an expected utility maximizer and will have goals and preferences it is optimizing for, and as a result we need to deal with the problems of expected utility maximizers anyway. Personally, I do not find this argument compelling, and hope to write about why in the near future. ETA: Written up in the chapter on Goals vs Utility Functions in the Value Learning sequence, particularly in Coherence arguments do not imply goal-directed behavior.
What links here?

[AN #69] Stuart Russell’s new book on why we need to replace the standard model of AI

Rohin Shah19 Oct 2019 0:30 UTC

60 points

12 comments15 min readLW link

(mailchi.mp)

Will humans build goal-directed agents?

Rohin Shah5 Jan 2019 1:33 UTC

60 points

43 comments5 min readLW link

Rohin Shah 27 Nov 2022 11:54 UTC
LW: 59 AF: 38
16
AF
on: Don’t align agents to evaluations of plans
We’re building intelligent AI systems that help us do stuff. Regardless of how the AI’s internal cognition works, it seems clear that the plans / actions it enacts have to be extremely strongly selected. With alignment, we’re trying to ensure that they are strongly selected to produce good outcomes, rather than being strongly selected for something else. So for any alignment proposal I want to see some reason that argues for “good outcomes” rather than “something else”.
In nearly all of the proposals I know of that seem like they have a chance of helping, at a high level the reason is “human(s) are a source of information about what is good, and this information influences what the AI’s plans are selected for”. (There are some cases based on moral realism.)
This is also the case with value-child: in that case, the mother is a source of info on what is good, she uses this to instill values in the child, those values then influence which plans value-child ends up enacting.
All such stories have a risk: what if the process of using [info about what is good] to influence [that which plans are selected for] goes wrong, and instead plans are strongly selected for some slightly-different thing? Then because optimization amplifies and value is fragile, the plans will produce bad outcomes.
I view this post as instantiating this argument for one particular class of proposals: cases in which we build an AI system that explicitly searches over a large space of plans, predicts their consequences, rates the consequences according to a prediction of what is “good”, and executes the highest-scoring plan. In such cases, you can more precisely restate “plans are strongly selected for some slightly-different thing” to “the agent executes plans that cause upwards-errors in the prediction of what is good”.
It’s an important argument! If you want to have an accurate picture of how likely such plans are to work, you really need to consider this point!
The part where I disagree is where the post goes on to say “and so we shouldn’t do this”. My response: what is the alternative, and why does it avoid or lessen the more abstract risk above?
I’d assume that the idea is that you produce AI systems that are more like “value-child”. Certainly I agree that if you successfully instill good values into your AI system, you have defused the risk argument above. But how did you do that? Why didn’t we instead get “almost-value-child”, who (say) values doing challenging things that require hard work, and so enrolls in harder and harder courses and gets worse and worse grades?
So far, this is a bit unfair to the post(s). It does have some additional arguments, which I’m going to rewrite in totally different language which I might be getting horribly wrong:
An AI system with a “direct (object-level) goal” is better than one with “indirect goals”. Specifically, you could imagine two things: (a) plans are selected for a direct goal (e.g. “make diamonds”) encoded inside the AI system, vs. (b) plans are selected for being evaluated as good by something encoded outside the AI system (e.g. “Alice’s approval”). I think the idea is that indirect goals clearly have issues (because the AI system is incentivized to trick the evaluator), while the direct goal has some shot at working, so we should aim for the direct goal.
I don’t buy this as stated; just as “you have a literally perfect overseer” seems theoretically possible but unrealistic, so too does “you instill the direct goal literally exactly correctly”. Presumably one of these works better in practice than the other, but it’s not obvious to me which one it is.
Separately, I don’t see this as all that relevant to what work we do in practice: even if we thought that we should be creating an AI system with a direct goal, I’d still be interested in iterated amplification, debate, interpretability, etc, because all of those seem particularly useful for instilling direct goals (given the deep learning paradigm). In particular even with a shard lens I’d be thinking about “how do I notice if my agent grew a shard that was subtly different from what I wanted” and I’d think of amplifying oversight as an obvious approach to tackle this problem. Personally I think it’s pretty likely that most of the AI systems we build and align in the near-to-medium term will have direct goals, even if we use techniques like iterated amplification and debate to build them.
Plan generation is safer. One theme is that with realistic agent cognition you only generate, say, 2-3 plans, and choose amongst those, which is very different from searching over all possible plans. I don’t think this inherently buys you any safety; this just means that you now have to consider how those 2-3 plans were generated (since they are presumably not random plans). Then you could make other arguments for safety (idk if the post endorses any of these):
1. Plans are selected based on historical experience. Instead of considering novel plans where you are relying more on your predictions of how the plans will play out, the AI could instead only consider plans that are very similar to plans that have been tried previously (by humans or AIs), where we have seen how such plans have played out and so have a better idea of whether they are good or not. I think that if we somehow accomplished this it would meaningfully improve safety in the medium term, but eventually we will want to have very novel plans as well and then we’d be back to our original problem.
2. Plans are selected from amongst a safe subset of plans. This could in theory work, but my next question would be “what is this safe subset, and why do you expect plans to be selected from it?” That’s not to say it’s impossible, just that I don’t see the argument for it.
3. Plans are selected based on values. In other words we’ve instilled values into the AI system, the plans are selected for those values. I’d critique this the same way as above, i.e. it’s really unclear how we successfully instilled values into the AI system and we could have instilled subtly wrong values instead.
4. Plans aren’t selected strongly. You could say that the 2-3 plans aren’t strongly selected for anything, so they aren’t likely to run into these issues. I think this is assuming that your AI system isn’t very capable; this sounds like the route of “don’t build powerful AI” (which is a plausible route).
In summary:
1. Intelligence ⇒ strong selection pressure ⇒ bad outcomes if the selection pressure is off target.
2. In the case of agents that are motivated to optimize evaluations of plans, this argument turns into “what if the agent tricks the evaluator”.
3. In the case of agents that pursue values / shards instilled by some other process, this argument turns into “what if the values / shards are different from what we wanted”.
4. To argue for one of these over the other, you need to compare these two arguments. However, this post is stating point 2 while ignoring point 3.
What links here?

Rohin Shah 21 Jun 2020 20:03 UTC
LW: 59 AF: 28
AF
on: The ground of optimization
This is excellent, it feels way better as a definition of optimization than past attempts :) Thanks in particular for the academic style, specifically relating it to previous work, it made it much more accessible for me.
Let’s try to build up some core AI alignment arguments with this definition.
Task: A task is simply an “environment” along with a target configuration set. Whenever I talk about a “task” below, assume that I mean an “interesting” task, i.e. something like “build a chair”, as opposed to “have the air molecules be in one of these particular configurations”.
Solving a task: An object O solves a task T if adding O to T’s environment transforms it into an optimizing system for the T’s target configuration set.
Performance on the task: If O solves task T, its performance is quantified by how quickly it reaches the target configuration set, and how robust it is to perturbations.
Generality of intelligence: The generality of O’s intelligence is a function of the number and diversity of tasks T that it can solve, as well as its performance on those tasks.
Optimizing AI: A computer program for which there exists an interesting task such that the computer program solves that task.
This isn’t exactly right, as it includes e.g. accounting programs or video games, which when paired with a human form an optimizing system for correct financials and winning the game, respectively. You might be able to fix this by saying that the optimizing system has to be robust to perturbations in any human behavior in the environment.
AGI: An optimizing AI whose generality of intelligence is at least as great as that of humans.
Argument for AI risk: As optimizing AIs become more and more general, we will apply them to more economically useful tasks T. However, they also become more and more robust to perturbations, possibly including perturbations such as “we try to turn off the AI”. As a result, we might eventually have AIs that form strong optimizing systems for some task T that isn’t the one we actually wanted, which tends to be bad due to fragility of value.
Deep learning AGI implies mesa optimization: Since deep learning is so sample inefficient, it cannot reach human levels of performance if we apply deep learning directly to each possible task T. (For example, it has to relearn how the world works separately for each task T.) As a result, if we do get AGI primarily via deep learning, it must be that we used deep learning to create a new optimizing AI system, and that system was the AGI.
Argument for mesa optimization: Due to the complexity and noise in the real world, most economically useful tasks require setting up a robust optimizing system, rather than directly creating the target configuration state. (See also the importance of feedback for more on this intuition.) It seems likely that humans will find it easier to create algorithms that then find AGIs that can create these robust optimizing systems, rather than creating an algorithm that is directly an AGI.
(The previous argument also applies: this is basically just a generalization of the previous point to arbitrary AI systems, instead of only deep learning.)
I want to note that under this approach the notion of “search” and “mesa objective” are less natural, which I see as a pro of this approach (see also here): the argument is that we’ll get a general inner optimizing AI, but it doesn’t say much about what task that AI will be optimizing for (and it could be an optimizing AI that is retargetable by human instructions).
Outer alignment: ??? Seems hard to formalize in this framework. This makes me feel like outer alignment is less important as a concept. (I also don’t particularly like formalizations outside of this framework.)
Inner alignment: Ensuring that (conditional on mesa optimization occurring) the inner AGI is aligned with the operator / user, that is, combined with the user it forms an optimizing system for “doing what the user wants”. (Note that this is explicitly not intent alignment, as it is hard to formalize intent alignment in this framework.)
Intent alignment: ??? As mentioned above, it’s hard to formalize in this framework, as intent alignment really does require some notion of “motivation”, “goals”, or “trying”, which this framework explicitly leaves out. I see this as a con of this framework.
Expected utility maximization: One particular architecture that could qualify as an AGI (if the utility function is treated as part of the environment, and not part of the AGI). I see the fact that EU maximization is no longer highlighted as a pro of this approach.
Wireheading: Special case of the argument for AI risk with a weird task of “maximize the number in this register”. Unnatural in this framing of the AI risk problem. I see this as a pro of this framing of the problem, though I expect people disagree with me on this point.
What links here?

BASALT: A Benchmark for Learning from Human Feedback

Rohin Shah8 Jul 2021 17:40 UTC

56 points

20 comments2 min readLW link

(bair.berkeley.edu)