Rohin Shah(Rohin Shah)

Karma: 14,281

Research Scientist at DeepMind. Creator of the Alignment Newsletter. http://rohinshah.com/

Rohin Shah 19 Sep 2020 2:15 UTC
LW: 97 AF: 37
AF
on: Draft report on AI timelines
Planned summary for the Alignment Newsletter:
Once again, we have a piece of work so large and detailed that I need a whole newsletter to summarize it! This time, it is a quantitative model for forecasting when transformative AI will happen.
The overall framework
The key assumption behind this model is that if we train a neural net or other ML model that uses about as much computation as a human brain, that will likely result in transformative AI (TAI) (defined as AI that has an impact comparable to that of the industrial revolution). In other words, we _anchor_ our estimate of the ML model’s inference computation to that of the human brain. This assumption allows us to estimate how much compute will be required to train such a model _using 2020 algorithms_. By incorporating a trend extrapolation of how algorithmic progress will reduce the required amount of compute, we can get a prediction of how much compute would be required for the final training run of a transformative model in any given year.
We can also get a prediction of how much compute will be _available_ by predicting the cost of compute in a given year (which we have a decent amount of past evidence about), and predicting the maximum amount of money an actor would be willing to spend on a single training run. The probability that we can train a transformative model in year Y is then just the probability that the compute _requirement_ for year Y is less than the compute _available_ in year Y.
The vast majority of the report is focused on estimating the amount of compute required to train a transformative model using 2020 algorithms (where most of our uncertainty would come from); the remaining factors are estimated relatively quickly without too much detail. I’ll start with those so that you can have them as background knowledge before we delve into the real meat of the report. These are usually modeled as logistic curves in log space: that is, they are modeled as improving at some constant rate, but will level off and saturate at some maximum value after which they won’t improve.
Algorithmic progress
First off, we have the impact of _algorithmic progress_. <@AI and Efficiency@> estimates that algorithms improve enough to cut compute times in half every 16 months. However, this was measured on ImageNet, where researchers are directly optimizing for reduced computation costs. It seems less likely that researchers are doing as good a job at reducing computation costs for “training a transformative model”, and so the author increases the **halving time to 2-3 years**, with a maximum of **somewhere between 1-5 orders of magnitude** (with the assumption that the higher the “technical difficulty” of the problem, the more algorithmic progress is possible).
Cost of compute
Second, we need to estimate a trend for compute costs. There has been some prior work on this (summarized in [AN #97](https://mailchi.mp/a2b5efbcd3a7/an-97-are-there-historical-examples-of-large-robust-discontinuities)). The report has some similar analyses, and ends up estimating **a doubling time of 2.5 years**, and a (very unstable) maximum of improvement by **a factor of 2 million by 2100**.
Willingness to spend
Third, we would like to know the maximum amount (in 2020 dollars) any actor might spend on a single training run. Note that we are estimating the money spent on a _final training run_, which doesn’t include the cost of initial experiments or the cost of researcher time. Currently, the author estimates that all-in project costs are 10-100x larger than the final training run cost, but this will likely go down to something like 2-10x, as the incentive for reducing this ratio becomes much larger.
The author estimates that the most expensive run _in a published paper_ was the final <@AlphaStar@>(@AlphaStar: Mastering the Real-Time Strategy Game StarCraft II@) training run, at ~1e23 FLOP and $1M cost. However, there have probably been unpublished results that are slightly more expensive, maybe $2-8M. In line with <@AI and Compute@>, this will probably increase dramatically to about **$1B in 2025**.
Given that AI companies each have around $100B cash on hand, and could potentially borrow additional several hundreds of billions of dollars (given their current market caps and likely growth in the worlds where AI still looks promising), it seems likely that low hundreds of billions of dollars could be spent on a single run by 2040, corresponding to a doubling time (from $1B in 2025) of about 2 years.
To estimate the maximum here, we can compare to megaprojects like the Manhattan Project or the Apollo program, which suggests that a government could spend around 0.75% of GDP for ~4 years. Since transformative AI will likely be more valuable economically and strategically than these previous programs, we can shade that upwards to 1% of GDP for 5 years. Assuming all-in costs are 5x that of the final training run, this suggests the maximum willingness to spend should be 1% of GDP of the largest country, which we assume grows at ~3% every year.
Strategy for estimating training compute for a transformative model
In addition to the three factors of algorithmic progress, cost of compute, and willingness to spend, we need an estimate of how much computation would be needed to train a transformative model using 2020 algorithms (which I’ll discuss next). Then, at year Y, the compute required is given by computation needed with 2020 algorithms * improvement factor from algorithmic progress, which (in this report) is a probability distribution. At year Y, the compute available is given by FLOP per dollar (aka compute cost) * money that can be spent, which (in this report) is a point estimate. We can then simply read off the probability that the compute required is greater than the compute available.
Okay, so the last thing we need is a distribution over the amount of computation that would be needed to train a transformative model using 2020 algorithms, which is the main focus of this report. There is a lot of detail here that I’m going to elide over, especially in talking about the _distribution_ as a whole (whereas I will focus primarily on the median case for simplicity). As I mentioned early on, the key hypothesis is that we will need to train a neural net or other ML model that uses about as much compute as a human brain. So the strategy will be to first translate from “compute of human brain” to “inference compute of neural net”, and then to translate from “inference compute of neural net” to “training compute of neural net”.
How much inference compute would a transformative model use?
We can talk about the rate at which synapses fire in the human brain. How can we convert this to FLOP? The author proposes the following hypothetical: suppose we redo evolutionary history, but in every animal we replace each neuron with N [floating-point units](https://en.wikipedia.org/wiki/Floating-point_unit) that each perform 1 FLOP per second. For what value of N do we still get roughly human-level intelligence over a similar evolutionary timescale? The author then does some calculations about simulating synapses with FLOPs, drawing heavily on the <@recent report on brain computation@>(@How Much Computational Power It Takes to Match the Human Brain@), to estimate that N would be around 1-10,000, which after some more calculations suggests that the human brain is doing the equivalent of 1e13 − 1e16 FLOP per second, with **a median of 1e15 FLOP per second**, and a long tail to the right.
Does this mean we can say that a transformative model will use 1e15 FLOP per second during inference? Such a model would have a clear flaw: even though we are assuming that algorithmic progress reduces compute costs over time, if we did the same analysis in e.g. 1980, we’d get the _same_ estimate for the compute cost of a transformative model, which would imply that there was no algorithmic progress between 1980 and 2020! The problem is that we’d always estimate the brain as using 1e15 FLOP per second (or around there), but for our ML models there is a difference between FLOP per second _using 2020 algorithms_ and FLOP per second _using 1980 algorithms_. So how do we convert form “brain FLOP per second” to “inference FLOP per second for 2020 ML algorithms”?
One approach is to look at how other machines we have designed compare to the corresponding machines that evolution has designed. An [analysis](https://docs.google.com/document/d/1HUtUBpRbNnnWBxiO2bz3LumEsQcaZioAPZDNcsWPnos/edit) by Paul Christiano concluded that human-designed artifacts tend to be 2-3 orders of magnitude worse than those designed by evolution, when considering energy usage. Presumably a similar analysis done in the past would have resulted in higher numbers and thus wouldn’t fall prey to the problem above. Another approach is to compare existing ML models to animals with a similar amount of computation, and see which one is subjectively “more impressive”. For example, the AlphaStar model uses about as much computation as a bee brain, and large language models use somewhat more; the author finds it reasonable to say that AlphaStar is “about as sophisticated” as a bee, or that <@GPT-3@>(@Language Models are Few-Shot Learners@) is “more sophisticated” than a bee.
We can also look at some abstract considerations. Natural selection had _a lot_ of time to optimize brains, and natural artifacts are usually quite impressive. On the other hand, human designers have the benefit of intelligent design and can copy the patterns that natural selection has come up with. Overall, these considerations roughly balance each other out. Another important consideration is that we’re only predicting what would be needed for a model that was good at most tasks that a human would currently be good at (think a virtual personal assistant), whereas evolution optimized for a whole bunch of other skills that were needed in the ancestral environment. The author subjectively guesses that this should reduce our estimate of compute costs by about an order of magnitude.
Overall, putting all these considerations together, the author intuitively guesses that to convert from “brain FLOP per second” to “inference FLOP per second for 2020 ML algorithms”, we should add an order of magnitude to the median, and add another two orders of magnitude to the standard deviation to account for our large uncertainty. This results in a median of **1e16 FLOP per second** for the inference-time compute of a transformative model.
Training compute for a transformative model
We might expect a transformative model to run a forward pass **0.1 − 10 times per second** (which on the high end would match human reaction time of 100ms), and for each parameter of the neural net to contribute **1-100 FLOP per forward pass**, which implies that if the inference-time compute is 1e16 FLOP per second then the model should have **1e13 − 1e17 parameters**, with a median of **3e14 parameters**.
We now need to estimate how much compute it takes to train a transformative model with 3e14 parameters. We assume this is dominated by the number of times you have to run the model during training, or equivalently, the number of data points you train on times the number of times you train on each data point. (In particular, this assumes that the cost of acquiring data is negligible in comparison. The report argues for this assumption; for the sake of brevity I won’t summarize it here.)
For this, we need a relationship between parameters and data points, which we’ll assume will follow a power law KP^α, where P is the number of parameters and K and α are constants. A large number of ML theory results imply that the number of data points needed to reach a specified level of accuracy grows linearly with the number of parameters (i.e. α=1), which we can take as a weak prior. We can then update this with empirical evidence from papers. <@Scaling Laws for Neural Language Models@> suggests that for language models, data requirements scale as α=0.37 or as α=0.74, depending on what measure you look at. Meanwhile, [Deep Learning Scaling is Predictable, Empirically](https://arxiv.org/abs/1712.00409) suggests that α=1.39 for a wide variety of supervised learning problems (including language modeling). However, the former paper studies a more relevant setting: it includes regularization, and asks about the number of data points needed to reach a target accuracy, whereas the latter paper ignores regularization and asks about the minimum number of data points that the model _cannot_ overfit to. So overall the author puts more weight on the former paper and estimates a median of α=0.8, though with substantial uncertainty.
We also need to estimate how many epochs will be needed, i.e. how many times we train on any given data point. The author decides not to explicitly model this factor since it will likely be close to 1, and instead lumps in the uncertainty over the number of epochs with the uncertainty over the constant factor in the scaling law above. We can then look at language model runs to estimate a scaling law for them, for which the median scaling law predicts that we would need 1e13 data points for our 3e14 parameter model.
However, this has all been for supervised learning. It seems plausible that a transformative task would have to be trained using RL, where the model acts over a sequence of timesteps, and then receives (non-differentiable) feedback at the end of those timesteps. How would scaling laws apply in this setting? One simple assumption is to say that each rollout over the _effective horizon_ counts as one piece of “meaningful feedback” and so should count as a single data point. Here, the effective horizon is the minimum of the actual horizon and 1/(1-γ), where γ is the discount factor. We assume that the scaling law stays the same; if we instead try to estimate it from recent RL runs, it can change the results by about one order of magnitude.
So we now know we need to train a 3e14 parameter model with 1e13 data points for a transformative task. This gets us nearly all the way to the compute required with 2020 algorithms: we have a ~3e14 parameter model that takes ~1e16 FLOP per forward pass, that is trained on ~1e13 data points with each data point taking H timesteps, for a total of H * 1e29 FLOP. The author’s distributions are instead centered at H * 1e30 FLOP; I suspect this is simply because the author was computing with distributions whereas I’ve been directly manipulating medians in this summary.
The last and most uncertain piece of information is the effective horizon of a transformative task. We could imagine something as low as 1 subjective second (for something like language modeling), or something as high as 1e9 subjective seconds (i.e. 32 subjective years), if we were to redo evolution, or train on a task like “do effective scientific R&D”. The author splits this up into short, medium and long horizon neural net paths (corresponding to horizons of 1e0-1e3, 1e3-1e6, and 1e6-1e9 respectively), and invites readers to place their own weights on each of the possible paths.
There are many important considerations here: for example, if you think that the dominating cost will be generative modeling (GPT-3 style, but maybe also for images, video etc), then you would place more weight on short horizons. Conversely, if you think the hard challenge is to gain meta learning abilities, and that we probably need “data points” comparable to the time between generations in human evolution, then you would place more weight on longer horizons.
Adding three more potential anchors
We can now combine all these ingredients to get a forecast for when compute will be available to develop a transformative model! But not yet: we’ll first add a few more possible “anchors” for the amount of computation needed for a transformative model. (All of the modeling so far has “anchored” the _inference time computation of a transformative model_ to the _inference time computation of the human brain_.)
First, we can anchor _parameter count of a transformative model_ to the _parameter count of the human genome_, which has far fewer “parameters” than the human brain. Specifically, we assume that all the scaling laws remain the same, but that a transformative model will only require 7.5e8 parameters (the amount of information in the human genome) rather than our previous estimate of ~1e15 parameters. This drastically reduces the amount of computation required, though it is still slightly above that of the short-horizon neural net, because the author assumed that the horizon for this path was somewhere between 1 and 32 years.
Second, we can anchor _training compute for a transformative model_ to the _compute used by the human brain over a lifetime_. As you might imagine, this leads to a much smaller estimate: the brain uses ~1e24 FLOP over 32 years of life, which is only 10x the amount used for AlphaStar, and even after adjusting upwards to account for man-made artifacts being worse than those made by evolution, the resulting model predicts a significant probability that we would already have been able to build a transformative model.
Finally, we can anchor _training compute for a transformative model_ to the _compute used by all animal brains over the course of evolution_. The basic assumption here is that our optimization algorithms and architectures are not much better than simply “redoing” natural selection from a very primitive starting point. This leads to an estimate of ~1e41 FLOP to train a transformative model, which is more than the long horizon neural net path (though not hugely more).
Putting it all together
So we now have six different paths: the three neural net anchors (short, medium and long horizon), the genome anchor, the lifetime anchor, and the evolution anchor. We can now assign weights to each of these paths, where each weight can be interpreted as the probability that that path is the _cheapest_ way to get a transformative model, as well as a final weight that describes the chance that none of the paths work out.
The long horizon neural net path can be thought of as a conservative “default” view: it could work out simply by training directly on examples of a long horizon task where each data point takes around a subjective year to generate. However, there are several reasons to think that researchers will be able to do better than this. As a result, the author assigns 20% to the short horizon neural net, 30% to the medium horizon neural net, and 15% to the long horizon neural net.
The lifetime anchor would suggest that we either already could get TAI, or are very close, which seems very unlikely given the lack of major economic applications of neural nets so far, and so gets assigned only 5%. The genome path gets 10%, the evolution anchor gets 10%, and the remaining 10% is assigned to none of the paths working out.
This predicts a **median of 2052** for the year in which some actor would be willing and able to train a single transformative model, with the full graphs shown below:
<Graphs removed since they are in flux and easy to share in a low-bandwidth way>
How does this relate to TAI?
Note that what we’ve modeled so far is the probability that by year Y we will have enough compute for the final training run of a transformative model. This is not the same thing as the probability of developing TAI. There are several reasons that TAI could be developed _later_ than the given prediction:
1. Compute isn’t the only input required: we also need data, environments, human feedback, etc. While the author expects that these will not be the bottleneck, this is far from a certainty.
2. When thinking about any particular path and making it more concrete, a host of problems tend to show up that will need to be solved and may add extra time. Some examples include robustness, reliability, possible breakdown of the scaling laws, the need to generate lots of different kinds of data, etc.
3. AI research could stall, whether because of regulation, a global catastrophe, an AI winter, or something else.
However, there are also compelling reasons to expect TAI to arrive _earlier_:
1. We may develop TAI through some other cheaper route, such as a <@services model@>(@Reframing Superintelligence: Comprehensive AI Services as General Intelligence@).
2. Our forecasts apply to a “balanced” model that has a similar profile of abilities as a human. In practice, it will likely be easier and cheaper to build an “unbalanced” model that is superhuman in some domains and subhuman in others, that is nonetheless transformative.
3. The curves for several factors assume some maximum after which progress is not possible; in reality it is more likely that progress slows to some lower but non-zero growth rate.
In the near future, it seems likely that it would be harder to find cheaper routes (since there is less time to do the research), so we should probably assume that the probabilities are overestimates, and for similar reasons for later years the probabilities should be treated as underestimates.
For the median of 2052, the author guesses that these considerations roughly cancel out, and so rounds the median for development of TAI to **2050**. A sensitivity analysis concludes that 2040 is the “most aggressive plausible median”, while the “most conservative plausible median” is 2080.
Planned opinion:
I really liked this report: it’s extremely thorough and anticipates and responds to a large number of potential reactions. I’ve made my own timelines estimate using the provided spreadsheet, and have adopted the resulting graph (with a few modifications) as my TAI timeline (which ends up with a median of ~2055). This is saying quite a lot: it’s pretty rare that a quantitative model is compelling enough that I’m inclined to only slightly edit its output, as opposed to simply using the quantitative model to inform my intuitions.
Here are the main ways in which my model is different from the one in the report:
1. Ignoring the genome anchor
I ignore the genome anchor because I don’t buy the model: even if researchers did create a very parameter-efficient model class (which seems unlikely), I would not expect the same scaling laws to apply to that model class. The report mentions that you could also interpret the genome anchor as simply providing a constraint on how many data points are needed to train long-horizon behaviors (since that’s what evolution was optimizing), but I prefer to take this as (fairly weak) evidence that informs what weights to place on short vs. medium vs. long horizons for neural nets.

2. Placing more weight on short and medium horizons relative to long horizons
I place 30% on short horizons, 40% on medium horizons, and 10% on long horizons. The report already names several reasons why we might expect the long horizon assumption to be too conservative. I agree with all of those, and have one more of my own:
If meta-learning turns out to require a huge amount of compute, we can instead directly train on some transformative task with a lower horizon. Even some of the hardest tasks like scientific R&D shouldn’t have a huge horizon: even if we assume that it takes human scientists a year to produce the equivalent of a single data point, at 40 hours a week that comes out to a horizon of 2000 subjective hours, or 7e6 seconds. This is near the beginning of the long horizon realm of 1e6-1e9 seconds and seems like a very conservative overestimate to me.
(Note that in practice I’d guess we will train something like a meta-learner, because I suspect the skill of meta-learning will not require such large average effective horizons.)
3. Reduced willingness to spend
My willingness to spend forecasts are somewhat lower: the predictions and reasoning in this report feel closer to upper bounds on how much people might spend rather than predictions of how much they will spend. Assuming we reduce the ratio of all-in project costs to final training run costs to 10x, spending $1B on a training run by 2025 would imply all-in project costs of $10B, which is ~40% of Google’s yearly R&D budget of $26B, or 10% of the budget for a 4-year project. Possibly this wouldn’t be classified as R&D, but it would also be _2% of all expenditures over 4 years_. This feels remarkably high to me for something that’s supposed to happen within 5 years; while I wouldn’t rule it out, it wouldn’t be my median prediction.
4. Accounting for challenges
While the report does talk about challenges in e.g. getting the right data and environments by the right time, I think there are a bunch of other challenges as well: for example, you need to ensure that your model is aligned, robust, and reliable (at least if you want to deploy it and get economic value from it). I do expect that these challenges will be easier than they are today, partly because more research will have been done and partly because the models themselves will be more capable.
Another example of a challenge would be PR concerns: it seems very plausible to me that there will be a backlash against transformative AI systems, that results in those systems being deployed later than we’d expect them to be according to this model.
To be more concrete, if we ignore points 1-3 and assume this is my only disagreement, then for the median of 2052, rather than assuming that reasons for optimism and pessimism approximately cancel out to yield 2050 as the median for TAI, I’d be inclined to shade upwards to 2055 or 2060 as my median for TAI.
What links here?

Rohin Shah 13 May 2022 7:44 UTC
68 points
in reply to: spell_chekist’s comment on: Deepmind’s Gato: Generalist Agent
Not really? On timelines, I haven’t looked through the results so maybe they’re more surprising then they look on a brief skim, but “you can do multitask learning with a single network” feels totally unsurprising given past results. Like, if nothing else the network could allocate 10% of itself to each domain; 100M parameters are more than enough to show good performance in these domains (robotics often uses far fewer parameters iirc). But also I would have expected some transfer between tasks so that you’d do better than that would naively predict. I’ve seen this before—iirc there was a result (from Pieter Abbeel’s lab? EDIT: this one EDIT 2: see caveats on this paper, though it doesn’t affect my point) a couple of years ago that showed that pretraining a model on language would lead to improved sample efficiency in some nominally-totally-unrelated RL task, or something like that. ~~Unfortunately I can’t find it on a quick Google now (and it’s possible it never made it into a paper and I heard it via word of mouth).~~
Having not read the detailed results yet, I would be quite surprised if it performed better on language-only tasks than a pretrained language model of the same size; I’d be a little surprised if it performed better on robotics / RL tasks than a specialized model of the same size given the same amount of robotics data.
In general, from a “timelines to risky systems” perspective, I’m not that interested in these sorts of “generic agents” that can do all the things with one neural net; it seems like it will be far more economically useful to have separate neural nets doing each of the things and using each other as tools to accomplish particular tasks and so that’s what I expect to see.
On pessimism, I’m not sure why I should update in any direction on this result, even if I thought this was surprisingly fast progress which I don’t. I guess shorter timelines would increase pessimism just by us having less time to prepare, but I don’t see any other good reason for increased pessimism.

Rohin Shah 16 Nov 2018 21:20 UTC
LW: 65 AF: 24
AF
on: Clarifying “AI Alignment”
Ultimately, our goal is to build AI systems that do what we want them to do. One way of decomposing this is first to define the behavior that we want from an AI system, and then to figure out how to obtain that behavior, which we might call the definition-optimization decomposition. Ambitious value learning aims to solve the definition subproblem. I interpret this post as proposing a different decomposition of the overall problem. One subproblem is how to build an AI system that is trying to do what we want, and the second subproblem is how to make the AI competent enough that it actually does what we want. I like this motivation-competence decomposition for a few reasons:
- It isolates the major, urgent difficulty in a single subproblem. If we make an AI system that tries to do what we want, it could certainly make mistakes, but it seems much less likely to cause eg. human extinction. (Though it is certainly possible, for example by building an unaligned successor AI system, as mentioned in the post.) In contrast, with the definition-optimization decomposition, we need to solve both specification problems with the definition and robustness problems with the optimization.
- Humans seem to solve the motivation subproblem, whereas humans don’t seem to solve either the definition or the optimization subproblems. I can definitely imagine a human legitimately trying to help me, whereas I can’t really imagine a human knowing how to derive optimal behavior for my goals, nor can I imagine a human that can actually perform the optimal behavior to achieve some arbitrary goal.
- It is easier to apply to systems without much capability, though as the post notes, it probably still does need to have some level of capability. While a digit recognition system is useful, it doesn’t seem meaningful to talk about whether it is “trying” to help us.
- Relatedly, the safety guarantees seem to degrade more slowly and smoothly. With definition-optimization, if you get the definition even slightly wrong, Goodhart’s Law suggests that you can get very bad outcomes. With motivation-competence, I’ve already argued that incompetence probably leads to small problems, not big ones, and slightly worse motivation might not make a huge difference because of something analogous to the basin of attraction around corrigibility. This depends a lot on what “slightly worse” means for motivation, but I’m optimistic.
- We’ve been working with the definition-optimization decomposition for quite some time now by modeling AI systems as expected utility maximizers, and we’ve found a lot of negative results and not very many positive ones.
- The motivation-competence decomposition accommodates interaction between the AI system and humans, which definition-optimization does not allow (or at least, it makes it awkward to include such interaction).
The cons are:
- It is imprecise and informal, whereas we can use the formalism of expected utility maximizers for the definition-optimization decomposition.
- There hasn’t been much work done in this paradigm, so it is not obvious that there is progress to make.
- I suspect many researchers would argue that any sufficiently intelligent system will be well-modeled as an expected utility maximizer and will have goals and preferences it is optimizing for, and as a result we need to deal with the problems of expected utility maximizers anyway. Personally, I do not find this argument compelling, and hope to write about why in the near future. ETA: Written up in the chapter on Goals vs Utility Functions in the Value Learning sequence, particularly in Coherence arguments do not imply goal-directed behavior.
What links here?

Rohin Shah 27 Nov 2022 11:54 UTC
LW: 59 AF: 38
16
AF
on: Don’t align agents to evaluations of plans
We’re building intelligent AI systems that help us do stuff. Regardless of how the AI’s internal cognition works, it seems clear that the plans / actions it enacts have to be extremely strongly selected. With alignment, we’re trying to ensure that they are strongly selected to produce good outcomes, rather than being strongly selected for something else. So for any alignment proposal I want to see some reason that argues for “good outcomes” rather than “something else”.
In nearly all of the proposals I know of that seem like they have a chance of helping, at a high level the reason is “human(s) are a source of information about what is good, and this information influences what the AI’s plans are selected for”. (There are some cases based on moral realism.)
This is also the case with value-child: in that case, the mother is a source of info on what is good, she uses this to instill values in the child, those values then influence which plans value-child ends up enacting.
All such stories have a risk: what if the process of using [info about what is good] to influence [that which plans are selected for] goes wrong, and instead plans are strongly selected for some slightly-different thing? Then because optimization amplifies and value is fragile, the plans will produce bad outcomes.
I view this post as instantiating this argument for one particular class of proposals: cases in which we build an AI system that explicitly searches over a large space of plans, predicts their consequences, rates the consequences according to a prediction of what is “good”, and executes the highest-scoring plan. In such cases, you can more precisely restate “plans are strongly selected for some slightly-different thing” to “the agent executes plans that cause upwards-errors in the prediction of what is good”.
It’s an important argument! If you want to have an accurate picture of how likely such plans are to work, you really need to consider this point!
The part where I disagree is where the post goes on to say “and so we shouldn’t do this”. My response: what is the alternative, and why does it avoid or lessen the more abstract risk above?
I’d assume that the idea is that you produce AI systems that are more like “value-child”. Certainly I agree that if you successfully instill good values into your AI system, you have defused the risk argument above. But how did you do that? Why didn’t we instead get “almost-value-child”, who (say) values doing challenging things that require hard work, and so enrolls in harder and harder courses and gets worse and worse grades?
So far, this is a bit unfair to the post(s). It does have some additional arguments, which I’m going to rewrite in totally different language which I might be getting horribly wrong:
An AI system with a “direct (object-level) goal” is better than one with “indirect goals”. Specifically, you could imagine two things: (a) plans are selected for a direct goal (e.g. “make diamonds”) encoded inside the AI system, vs. (b) plans are selected for being evaluated as good by something encoded outside the AI system (e.g. “Alice’s approval”). I think the idea is that indirect goals clearly have issues (because the AI system is incentivized to trick the evaluator), while the direct goal has some shot at working, so we should aim for the direct goal.
I don’t buy this as stated; just as “you have a literally perfect overseer” seems theoretically possible but unrealistic, so too does “you instill the direct goal literally exactly correctly”. Presumably one of these works better in practice than the other, but it’s not obvious to me which one it is.
Separately, I don’t see this as all that relevant to what work we do in practice: even if we thought that we should be creating an AI system with a direct goal, I’d still be interested in iterated amplification, debate, interpretability, etc, because all of those seem particularly useful for instilling direct goals (given the deep learning paradigm). In particular even with a shard lens I’d be thinking about “how do I notice if my agent grew a shard that was subtly different from what I wanted” and I’d think of amplifying oversight as an obvious approach to tackle this problem. Personally I think it’s pretty likely that most of the AI systems we build and align in the near-to-medium term will have direct goals, even if we use techniques like iterated amplification and debate to build them.
Plan generation is safer. One theme is that with realistic agent cognition you only generate, say, 2-3 plans, and choose amongst those, which is very different from searching over all possible plans. I don’t think this inherently buys you any safety; this just means that you now have to consider how those 2-3 plans were generated (since they are presumably not random plans). Then you could make other arguments for safety (idk if the post endorses any of these):
1. Plans are selected based on historical experience. Instead of considering novel plans where you are relying more on your predictions of how the plans will play out, the AI could instead only consider plans that are very similar to plans that have been tried previously (by humans or AIs), where we have seen how such plans have played out and so have a better idea of whether they are good or not. I think that if we somehow accomplished this it would meaningfully improve safety in the medium term, but eventually we will want to have very novel plans as well and then we’d be back to our original problem.
2. Plans are selected from amongst a safe subset of plans. This could in theory work, but my next question would be “what is this safe subset, and why do you expect plans to be selected from it?” That’s not to say it’s impossible, just that I don’t see the argument for it.
3. Plans are selected based on values. In other words we’ve instilled values into the AI system, the plans are selected for those values. I’d critique this the same way as above, i.e. it’s really unclear how we successfully instilled values into the AI system and we could have instilled subtly wrong values instead.
4. Plans aren’t selected strongly. You could say that the 2-3 plans aren’t strongly selected for anything, so they aren’t likely to run into these issues. I think this is assuming that your AI system isn’t very capable; this sounds like the route of “don’t build powerful AI” (which is a plausible route).
In summary:
1. Intelligence ⇒ strong selection pressure ⇒ bad outcomes if the selection pressure is off target.
2. In the case of agents that are motivated to optimize evaluations of plans, this argument turns into “what if the agent tricks the evaluator”.
3. In the case of agents that pursue values / shards instilled by some other process, this argument turns into “what if the values / shards are different from what we wanted”.
4. To argue for one of these over the other, you need to compare these two arguments. However, this post is stating point 2 while ignoring point 3.
What links here?

Rohin Shah 21 Jun 2020 20:03 UTC
LW: 59 AF: 28
AF
on: The ground of optimization
This is excellent, it feels way better as a definition of optimization than past attempts :) Thanks in particular for the academic style, specifically relating it to previous work, it made it much more accessible for me.
Let’s try to build up some core AI alignment arguments with this definition.
Task: A task is simply an “environment” along with a target configuration set. Whenever I talk about a “task” below, assume that I mean an “interesting” task, i.e. something like “build a chair”, as opposed to “have the air molecules be in one of these particular configurations”.
Solving a task: An object O solves a task T if adding O to T’s environment transforms it into an optimizing system for the T’s target configuration set.
Performance on the task: If O solves task T, its performance is quantified by how quickly it reaches the target configuration set, and how robust it is to perturbations.
Generality of intelligence: The generality of O’s intelligence is a function of the number and diversity of tasks T that it can solve, as well as its performance on those tasks.
Optimizing AI: A computer program for which there exists an interesting task such that the computer program solves that task.
This isn’t exactly right, as it includes e.g. accounting programs or video games, which when paired with a human form an optimizing system for correct financials and winning the game, respectively. You might be able to fix this by saying that the optimizing system has to be robust to perturbations in any human behavior in the environment.
AGI: An optimizing AI whose generality of intelligence is at least as great as that of humans.
Argument for AI risk: As optimizing AIs become more and more general, we will apply them to more economically useful tasks T. However, they also become more and more robust to perturbations, possibly including perturbations such as “we try to turn off the AI”. As a result, we might eventually have AIs that form strong optimizing systems for some task T that isn’t the one we actually wanted, which tends to be bad due to fragility of value.
Deep learning AGI implies mesa optimization: Since deep learning is so sample inefficient, it cannot reach human levels of performance if we apply deep learning directly to each possible task T. (For example, it has to relearn how the world works separately for each task T.) As a result, if we do get AGI primarily via deep learning, it must be that we used deep learning to create a new optimizing AI system, and that system was the AGI.
Argument for mesa optimization: Due to the complexity and noise in the real world, most economically useful tasks require setting up a robust optimizing system, rather than directly creating the target configuration state. (See also the importance of feedback for more on this intuition.) It seems likely that humans will find it easier to create algorithms that then find AGIs that can create these robust optimizing systems, rather than creating an algorithm that is directly an AGI.
(The previous argument also applies: this is basically just a generalization of the previous point to arbitrary AI systems, instead of only deep learning.)
I want to note that under this approach the notion of “search” and “mesa objective” are less natural, which I see as a pro of this approach (see also here): the argument is that we’ll get a general inner optimizing AI, but it doesn’t say much about what task that AI will be optimizing for (and it could be an optimizing AI that is retargetable by human instructions).
Outer alignment: ??? Seems hard to formalize in this framework. This makes me feel like outer alignment is less important as a concept. (I also don’t particularly like formalizations outside of this framework.)
Inner alignment: Ensuring that (conditional on mesa optimization occurring) the inner AGI is aligned with the operator / user, that is, combined with the user it forms an optimizing system for “doing what the user wants”. (Note that this is explicitly not intent alignment, as it is hard to formalize intent alignment in this framework.)
Intent alignment: ??? As mentioned above, it’s hard to formalize in this framework, as intent alignment really does require some notion of “motivation”, “goals”, or “trying”, which this framework explicitly leaves out. I see this as a con of this framework.
Expected utility maximization: One particular architecture that could qualify as an AGI (if the utility function is treated as part of the environment, and not part of the AGI). I see the fact that EU maximization is no longer highlighted as a pro of this approach.
Wireheading: Special case of the argument for AI risk with a weird task of “maximize the number in this register”. Unnatural in this framing of the AI risk problem. I see this as a pro of this framing of the problem, though I expect people disagree with me on this point.
What links here?

Rohin Shah 19 Apr 2024 7:59 UTC
LW: 51 AF: 32
9
AF
on: Transformers Represent Belief State Geometry in their Residual Stream
Is it accurate to summarize the headline result as follows?
- Train a Transformer to predict next tokens on a distribution generated from an HMM.
- One optimal predictor for this data would be to maintain a belief over which of the three HMM states we are in, and perform Bayesian updating on each new token. That is, it maintains $p (hidden state = H_{i})$ .
- Key result: A linear probe on the residual stream is able to reconstruct $p (hidden state = H_{i})$ .
(I don’t know what Computational Mechanics or MSPs are so this could be totally off.)
EDIT: Looks like yes. From this post:
Part of what this all illustrates is that the fractal shape is kinda… baked into any Bayesian-ish system tracking the hidden state of the Markov model. So in some sense, it’s not very surprising to find it linearly embedded in activations of a residual stream; all that really means is that the probabilities for each hidden state are linearly represented in the residual stream.

Rohin Shah 15 May 2022 9:19 UTC
51 points
in reply to: Rohin Shah’s comment on: rohinmshah’s Shortform
<unfair rant with the goal of shaking people out of a mindset>
To all of you telling me or expecting me to update to shorter timelines given <new AI result>: have you ever encountered Bayesianism?
Surely if you did, you’d immediately reason that you couldn’t know how I would update, without first knowing what I expected to see in advance. Which you very clearly don’t know. How on earth could you know which way I should update upon observing this new evidence? In fact, why do you even care about which direction I update? That too shouldn’t give you much evidence if you don’t know what I expected in the first place.
Maybe I should feel insulted? That you think so poorly of my reasoning ability that I should be updating towards shorter timelines every time some new advance in AI comes out, as though I hadn’t already priced that into my timeline estimates, and so would predictably update towards shorter timelines in violation of conservation of expected evidence? But that only follows if I expect you to be a good reasoner modeling me as a bad reasoner, which probably isn’t what’s going on.
</unfair rant>
My actual guess is that people notice a discrepancy between their very-short timelines and my somewhat-short timelines, and then they want to figure out what causes this discrepancy, and an easily-available question is “why doesn’t X imply short timelines” and then for some reason that I still don’t understand they instead substitute the much worse question of “why didn’t you update towards short timelines on X” without noticing its major flaws.
Fwiw, I was extremely surprised by OpenAI Five working with just vanilla PPO (with reward shaping and domain randomization), rather than requiring any advances in hierarchical RL. I made one massive update then (in the sense that I immediately started searching for a new model that explained that result; it did take over a year to get to a model I actually liked). I also basically adopted the bio anchors timelines when that report was released (primarily because it agreed with my model, elaborated on it, and then actually calculated out its consequences, which I had never done because it’s actually quite a lot of work). Apart from those two instances I don’t think I’ve had major timeline updates.

Rohin Shah 5 Apr 2022 13:35 UTC
50 points
on: What an actually pessimistic containment strategy looks like
Not that you said otherwise, but just to be clear: it is not the case that most capabilities researchers at DeepMind or OpenAI have similar beliefs as people at EleutherAI (that alignment is very important to work on). I would not expect it to go well if you said “it seems like you guys are speeding up the deaths of everyone on the planet” at DeepMind.
Obviously there are other possible strategies; I don’t mean to say that nothing like this could ever work.

Rohin Shah 14 Nov 2021 10:28 UTC
LW: 49 AF: 29
AF
in reply to: Rob Bensinger’s comment on: Discussion with Eliezer Yudkowsky on AGI interventions
(Responding to entire comment thread) Rob, I don’t think you’re modeling what MIRI looks like from the outside very well.
- There’s a lot of public stuff from MIRI on a cluster that has as central elements decision theory and logic (logical induction, Vingean reflection, FDT, reflective oracles, Cartesian Frames, Finite Factored Sets...)
- There was once an agenda (AAMLS) that involved thinking about machine learning systems, but it was deprioritized, and the people working on it left MIRI.
- There was a non-public agenda that involved Haskell programmers. That’s about all I know about it. For all I know they were doing something similar to the modal logic work I’ve seen in the past.
- Eliezer frequently talks about how everyone doing ML work is pursuing dead ends, with potentially the exception of Chris Olah. Chris’s work is not central to the cluster I would call “experimentalist”.
- There has been one positive comment on the KL-divergence result in summarizing from human feedback. That wasn’t the main point of that paper and was an extremely predictable result.
- There has also been one positive comment on Redwood Research, which was founded by people who have close ties to MIRI. The current steps they are taking are not dramatically different from what other people have been talking about and/or doing.
- There was a positive-ish comment on aligning narrowly superhuman models, though iirc it gave off more of an impression of “well, let’s at least die in a slightly more dignified way”.
I don’t particularly agree with Adam’s comments, but it does not surprise me that someone could come to honestly believe the claims within them.
What links here?
- adamShimi's comment on Discussion with Eliezer Yudkowsky on AGI interventions by Rob Bensinger (15 Nov 2021 14:52 UTC; 121 points)

Rohin Shah 20 Jun 2022 2:16 UTC
LW: 48 AF: 22
14
AF
on: Where I agree and disagree with Eliezer
I agree with almost all of this, in the sense that if you gave me these claims without telling me where they came from, I’d have actively agreed with the claims.
Things that don’t meet that bar:
General: Lots of these points make claims about what Eliezer is thinking, how his reasoning works, and what evidence it is based on. I don’t necessarily have the same views, primarily because I’ve engaged much less with Eliezer and so don’t have confident Eliezer-models. (They all seem plausible to me, except where I’ve specifically noted disagreements below.)
Agreement 14: Not sure exactly what this is saying. If it’s “the AI will probably always be able to seize control of the physical process implementing the reward calculation and have it output the maximum value” I agree.
Agreement 16: I agree with the general point but I would want to know more about the AI system and how it was trained before evaluating whether it would learn world models + action consequences instead of “just being nice”, and even with the details I expect I’d feel pretty uncertain which was more likely.
Agreement 17: It seems totally fine to focus your attention on a specific subset of “easy-alignment” worlds and ensuring that those worlds survive, which could be described as “assuming there’s a hope”. That being said, there’s something in this vicinity I agree with: in trying to solve alignment, people sometimes make totally implausible assumptions about the world; this is a worse strategy for reducing x-risk than working on the worlds you actually expect and giving them another ingredient that, in combination with a “positive model violation”, could save those worlds.
Disagreement 10: I don’t have a confident take on the primate analogy; I haven’t spent enough time looking into it for that.
Disagreement 15: I read Eliezer as saying something different in point 11 of the list of lethalities than Paul attributes to him here; something more like “if you trained on weak tasks either (1) your AI system will be too weak to build nanotech or (2) it learned the general core of intelligence and will kill you once you get it to try building nanotech”. I’m not confident in my reading though.
Disagreement 18: I find myself pretty uncertain about what to expect in the “breed corrigible humans” thought experiment.
Disagreement 22: I was mostly in agreement with this, but “obsoleting human contributions to alignment” is a pretty high bar if you take it literally, and I don’t feel confident that happens before superintelligent understanding of the world (though it does seem plausible).
What links here?
- Rohin Shah's comment on A freshman year during the AI midgame: my approach to the next year by Buck (EA Forum; 21 Apr 2023 18:47 UTC; 13 points)

Rohin Shah 19 Aug 2020 19:50 UTC
LW: 46 AF: 25
AF
on: Matt Botvinick on the spontaneous emergence of learning algorithms
(EDIT: I’m already seeing downvotes of the post, it was originally at 58 AF karma. This wasn’t my intention: I think this is a failure of the community as a whole, not of the author.)
Okay, this has gotten enough karma and has been curated and has influenced another post, so I suppose I should engage, especially since I’m not planning to put this in the Alignment Newsletter.
(A lot copied over from this comment of mine)
This is extremely basic RL theory.
The linked paper studies bandit problems, where each episode of RL is a new bandit problem where the agent doesn’t know which arm gives maximal reward. Unsurprisingly, the agent learns to first explore, and then exploit the best arm. This is a simple consequence of the fact that you have to look at observations to figure out what to do. Basic POMDP theory will tell you that when you have partial observability your policy needs to depend on history, i.e. it needs to learn.
However, because bandit problems have been studied in the AI literature, and “learning algorithms” have been proposed to solve bandit problems, this very normal fact of a policy depending on observation history is now trotted out as “learning algorithms spontaneously emerge”. I don’t understand why this was surprising to the original researchers, it seems like if you just thought about what the optimal policy would be given the observable information, you would make exactly this prediction. Perhaps it’s because it’s primarily a neuroscience paper, and they weren’t very familiar with AI.
More broadly, I don’t understand what people are talking about when they speak of the “likelihood” of mesa optimization. If you mean the chance that the weights of a neural network are going to encode some search algorithm, then this paper should be ~zero evidence in favor of it. If you mean the chance than a policy trained by RL will “learn” without gradient descent, I can’t imagine a way that could fail to be true for an intelligent system trained by deep RL—presumably a system that is intelligent is capable of learning quickly, and when we talk about deep RL leading to an intelligent AI system, presumably we are talking about the policy being intelligent (what else?), therefore the policy must “learn” as it is being executed.
Gwern notes here that we’ve seen this elsewhere. This is because it’s exactly what you’d expect, just that in the other cases we call conditioning on observations “adaptation” rather than “learning”.
----
Meta: I’m disappointed that I had to be the one to point this out. (Though to be fair, Gwern clearly understands this point.) There’s clearly been a lot of engagement with this post, and yet this seemingly obvious point hasn’t been said. When I saw this post first come up, my immediate reaction was “oh I’m sure this is a typical LW example of a case where the optimal policy is interpreted as learning, I’m not even going to bother clicking on the link”. Do we really have so few people who understand machine learning, that of the many, many views this post must have had, not one person could figure this out? It’s really no surprise that ML researchers ignore us if this is the level of ML understanding we as a community have.
EDIT: I should give credit to Nevan for pointing out that this paper is not much evidence in favor of the hypothesis that the neural network weights encode some search algorithm (before I wrote this comment).
What links here?

Rohin Shah 13 Nov 2022 12:22 UTC
43 points
23
on: Ways to buy time
On the margin, we think more alignment researchers should work on “buying time” interventions instead of technical alignment research (or whatever else they were doing).
I’m quite a bit more pessimistic about having lots of people doing these approaches than you seem to be. In the abstract my concerns are somewhat similar to Habryka’s, but I think I can make them a lot more concrete given this post. The TL;DR is: (1) for half the things, I think they’re net negative if done poorly, and I think that’s probably the case on the current margin, and (2) for the other half of things, I think they’re great, and the way you accomplish them is by joining safety / governance teams at AI labs, which are already doing them and are in a much better position to do them than anyone else.
(When talking about industry labs here I’m thinking more about Anthropic and DeepMind—I know less about OpenAI, though I’d bet it applies to them too.)
Direct outreach to AGI researchers
Currently, I’d estimate there are ~50 people in the world who could make a case for working on AI alignment to me that I’d think wasn’t clearly flawed. (I actually ran this experiment with ~20 people recently, 1 person succeeded. EDIT: I looked back and explicitly counted—I ran it with at least 19 people, and 2 succeeded: one gave an argument for “AI risk is non-trivially likely”, another gave an argument for “this is a speculative worry but worth investigating” which I wasn’t previously counting but does meet my criterion above.) Those 50 people tend to be busy and in any case your post doesn’t seem to be directed at them. (Also, if we require people to write down an argument in advance that they defend, rather than changing it somewhat based on pushback from me, my estimate drops to, idk, 20 people.)
Now, even arguments that are clearly flawed to me could convince AGI researchers that AI risk is important. I tend to think that the sign of this effect is pretty unclear. On the one hand I don’t expect these researchers to do anything useful, partly because in my experience “person says AI safety is good” doesn’t translate into “person does things”, and partly because incorrect arguments lead to incorrect beliefs which lead to useless solutions. On the other hand maybe we’re just hoping for a general ethos of “AI risk is real” that causes political pressure to slow down AI.
But it really doesn’t seem great that my case for wide-scale outreach being good is “maybe if we create a mass delusion of incorrect beliefs that implies that AGI is risky, then we’ll slow down, and the extra years of time will help”. So overall my guess is that this is net negative.
(On my beliefs, which I acknowledge not everyone shares, expecting something better than “mass delusion of incorrect beliefs that implies that AGI is risky” if you do wide-scale outreach now is assuming your way out of reality.)
(Fwiw I do expect that there will be a major shift towards AI risk being taken more seriously, as AGI becomes more visceral to people, as outreach efforts continue, and as it becomes more of a culturally expected belief. I often view my job as trying to inject some good beliefs about AI risk among the oncoming deluge of beliefs about AI risk.)
Develop new resources that make AI x-risk arguments & problems more concrete
Seems good if done by one of the 20 people who can make a good argument without pushback from me. If you instead want this to be done on a wide scale I think you have basically the same considerations as above.
Demonstrate concerning capabilities & alignment failures
Seems probably net negative when done at a wide scale, as we’ll see demonstrations of “alignment failures” that aren’t actually related to the way I expect alignment failures to go, and then the most viral one (which won’t be the most accurate one) will be the one that dominates discourse.
Break and red team alignment proposals (especially those that will likely be used by major AI labs)
For the examples of work that you cite, my actual prediction is that they have had ~no effect on the broader ML community, but if they did have an effect, I’d predict that the dominant one is “wow these alignment folks have so much disagreement and say pretty random stuff, they’re not worth paying attention to”. So overall my take is that this is net-negative from the “buying time” perspective (though I think it is worth doing for other reasons).
Organize coordination events
I’m not seeing why any of the suggestions here are better than the existing strategy of “create alignment labs at industry orgs which do this sort of coordination”.
(But I do like the general goal! If you’re interested in doing this, consider trying to get hired at an industry alignment lab. It’s way easier to do this when you don’t have to navigate all of the confidentiality protocols because you’re a part of the company.)
I guess one benefit is that you can have some coordination between top alignment people who aren’t at industry labs? I’m much more keen on having those people just doing good alignment work, and coordinating with the industry alignment labs. This seems way more efficient.
Support safety and governance teams at major AI labs
Strongly in favor of the goal, but how do you do this other than by joining the teams?
Note that people should be aware of the risk that alignment-concerned people joining labs can lead to differential increases in capabilities, as reported here.
The linked article is about capabilities roles in labs, not safety / governance teams in labs. I’d guess that most people including many of those 11 anonymous experts would be pretty positive on having people join safety / governance teams in labs.
Develop and promote reasonable safety standard for AI labs
Sounds great! Seems like you should do it by joining the relevant teams at the AI labs, or at least having a lot of communication with them. (I think it’s way way harder to do outside of the labs because you are way less informed about what the constraints are and what standards would be feasible to coordinate on.)
You could do abstract research on safety standards with the hope that this turns into something useful a few years down the line. I’m somewhat pessimistic on this but much less confident in my pessimism here.
What links here?

Rohin Shah 12 Jul 2022 9:33 UTC
LW: 43 AF: 17
7
AF
on: On how various plans miss the hard bits of the alignment challenge
My guess at part of your views:
1. There’s ~one natural structure for capabilities, such that (assuming we don’t have deep mastery of intelligence) nearly anything we build that is an AGI will have that structure.
2. Given this, there will be a point where an AI system switches from everything-muddled-in-a-soup to clean capabilities and muddled alignment (the “sharp left turn”).
I basically agree that the plans I consider don’t engage much with this sort of scenario. This is mostly because I don’t expect this scenario and so I’m trying to solve the alignment problem in the worlds I do expect.
(For the reader: I am not saying “we’re screwed if the sharp left turn happens so we should ignore it”, I am saying that the sharp left turn is unlikely.)
A consequence is that I care a lot about knowing whether the sharp left turn is actually likely. Unfortunately so far I have found it pretty hard to understand why exactly you and Eliezer find it so likely. I think current SOTA on this disagreement is this post and I’d be keen on more work along those lines.
Some commentary on the conversation with me:
Imaginary Richard/Rohin: You seem awfully confident in this sharp left turn thing. And that the goals it was trained for won’t just generalize. This seems characteristically overconfident.
This isn’t exactly wrong—I do think you are overconfident—but I wouldn’t say something like “characteristically overconfident” unless you were advocating for some particular decision right now which depended on others deferring to your high credences in something. It just doesn’t seem useful to argue this point most of the time and it doesn’t feature much in my reasoning.
For instance, observe that natural selection didn’t try to get the inner optimizer to be aligned with inclusive genetic fitness at all. For all we know, a small amount of cleverness in exposing inner-misaligned behavior to the gradients will just be enough to fix the problem.
Good description of why I don’t find the evolution analogy compelling for “sharp left turn is very likely”.
And even if not that-exact-thing, then there are all sorts of ways that some other thing could come out of left field and just render the problem easy. So I don’t see why you’re worried.
I’d phrase it as “I don’t see why you think [sharp left turn leading to failures of generalization of alignment that we can’t notice and fix before we’re dead] is very likely to happen”. I’m worried too!
Nate: My model says that the hard problem rears its ugly head by default, in a pretty robust way. Clever ideas might suffice to subvert the hard problem (though my guess is that we need something more like understanding and mastery, rather than just a few clever ideas). I have considered an array of clever ideas that look to me like they would predictably-to-me fail to solve the problems, and I admit that my guess is that you’re putting most of your hope on small clever ideas that I can already see would fail. But perhaps you have ideas that I do not. Do you yourself have any specific ideas for tackling the hard problem?
Imaginary Richard/Rohin: Train it, while being aware of inner alignment issues, and hope for the best.
I think if you define the hard problem to be the sharp left turn as described at the beginning of my comment then my response is “no, I don’t usually focus on that problem” (which I would defend as the correct action to take).
Also if I had to summarize the plan in a sentence it would be “empower your oversight process as much as possible to detect problems in the AI system you’re training (both in the outcomes it produces and the reasoning process it employs)”.
Nate: That doesn’t seem to me to even start to engage with the issue where the capabilities fall into an attractor and the alignment doesn’t.
Yup, agreed.
Though if you weaken claim 1, that there is ~one natural structure to capabilities, to instead say that there are many possible structures to capabilities but the default one is deadly EU maximization, then I no longer agree. It seems pretty plausible to me that stronger oversight changes the structure of your capabilities.
Perhaps sometime we can both make a list of ways to train with inner alignment issues in mind, and then share them with each other, so that you can see whether you think I’m lacking awareness of some important tool you expect to be at our disposal, and so that I can go down your list and rattle off the reasons why the proposed training tools don’t look to me like they result in alignment that is robust to sharp left turns. (Or find one that surprises me, and update.) But I don’t want to delay this post any longer, so, some other time, maybe.
I think the more relevant cruxes are the claims at the top of this comment (particularly claim 1); I think if I’ve understood the “sharp left turn” correctly I agree with you that the approaches I have in mind don’t help much (unless the approaches succeed wildly, to the point of mastering intelligence, e.g. my approaches include mechanistic interpretability which as you agree could in theory get to that point even if they aren’t likely to in practice).

Rohin Shah 12 Sep 2020 18:06 UTC
42 points
in reply to: johnswentworth’s comment on: What’s Wrong with Social Science and How to Fix It: Reflections After Reading 2578 Papers
My claims are really just for CS, idk how much they apply to the social sciences, but the post gives me no reason to think they aren’t true for the social sciences as well.
- Just stop citing bad research, I shouldn’t need to tell you this, jesus christ what the fuck is wrong with you people.
This doesn’t work unless it’s common knowledge that the research is bad, since reviewers are looking for reasons to reject and “you didn’t cite this related work” is a classic one (and your paper might be reviewed by the author of the bad work). When I was early in my PhD, I had a paper rejected where it sounded like a major contributing factor was not citing a paper that I specifically thought was not related but the reviewer thought was.
- Read the papers you cite. Or at least make your grad students to do it for you. It doesn’t need to be exhaustive: the abstract, a quick look at the descriptive stats, a good look at the table with the main regression results, and then a skim of the conclusions. Maybe a glance at the methodology if they’re doing something unusual. It won’t take more than a couple of minutes. And you owe it not only to SCIENCE!, but also to yourself: the ability to discriminate between what is real and what is not is rather useful if you want to produce good research.23
I think the point of this recommendation is to get people to stop citing bad research. I doubt it will make a difference since as argued above the cause isn’t “we can’t tell which research is bad” but “despite knowing what’s bad we have to cite it anyway”.
- When doing peer review, reject claims that are likely to be false. The base replication rate for studies with p>.001 is below 50%. When reviewing a paper whose central claim has a p-value above that, you should recommend against publication unless the paper is exceptional (good methodology, high prior likelihood, etc.)24 If we’re going to have publication bias, at least let that be a bias for true positives. Remember to subtract another 10 percentage points for interaction effects. You don’t need to be complicit in the publication of false claims.
I have issues with this, but they aren’t related to me knowing more about academia than the author, so I’ll skip it. (And it’s more like, I’m uncertain about how good an idea this would be.)
- Stop assuming good faith. I’m not saying every academic interaction should be hostile and adversarial, but the good guys are behaving like dodos right now and the predators are running wild.
The evidence in the post suggesting that people aren’t acting in good faith is roughly “if you know statistics then it’s obvious that the papers you’re writing won’t replicate”. My guess is that many social scientists don’t know statistics and/or don’t apply it intuitively, so I don’t see a reason to reject the (a priori more plausible to me) hypothesis that most people are acting in okay-to-good faith.
I don’t really understand the author’s model here, but my guess is that they are assuming that academics primarily think about “here’s the dataset and here are the analysis results and here are the conclusions”. I can’t speak to social science, but when I’m trying to figure out some complicated thing (e.g “why does my algorithm work in setting X but not setting Y”) I spend most of my time staring at data, generating hypotheses, making predictions with them, etc. which is very very conducive to the garden of forking paths that the author dismisses out of hand.
EDIT: Added some discussion of the other recommendations below, though I know much less about them, and here I’m just relying more on my own intuition rather than my knowledge about academia:
Earmark 60% of funding for registered reports (ie accepted for publication based on the preregistered design only, not results). For some types of work this isn’t feasible, but for ¾ of the papers I skimmed it’s possible. In one fell swoop, p-hacking and publication bias would be virtually eliminated.
I’d be shocked if ³⁄₄ of social science papers could have been preregistered. My guess is that what happens is that researchers collect data, do a bunch of analyses, figure out some hypotheses, and only then write the paper.
Possibly the suggestion here is that all this exploratory work should be done first, then a study should be preregistered, and then the results are reported. My weak guess is that this wouldn’t actually help replicability very much—my understanding is that researchers are often able to replicate their own results, even when others can’t. (Which makes sense! If I try to describe to a CHAI intern an algorithm they should try running, I often have the experience that they do something differently than I was expecting. Ideally in social science results would be robust to small variations, but in practice they aren’t, and I wouldn’t strongly expect preregistration to help, though plausibly it would.)
An NSF/NIH inquisition that makes sure the published studies match the pre-registration (there’s so much “”“”””“”””QRP”“”””“”””″ in this area you wouldn’t believe). The SEC has the power to ban people from the financial industry—let’s extend that model to academia.
My general qualms about preregistration apply here too, but if we assume that we’re going to have a preregistration model, then this seems good to me.
Earmark 10% of funding for replications. When the majority of publications are registered reports, replications will be far less valuable than they are today. However, intelligently targeted replications still need to happen.
This seems good to me (though idk if 10% is the right number, I could see both higher and lower).
Increase sample sizes and lower the significance threshold to .005. This one needs to be targeted: studies of small effects probably need to quadruple their sample sizes in order to get their power to reasonable levels. The median study would only need 2x or so. Lowering alpha is generally preferable to increasing power. “But Alvaro, doesn’t that mean that fewer grants would be funded?” Yes.
Personally, I don’t like the idea of significance thresholds and required sample sizes. I like having quantitative data because it informs my intuitions; I can’t just specify a hard decision rule based on how some quantitative data will play out.
Even if this were implemented, I wouldn’t predict much effect on reproducibility: I expect that what happens is the papers we get have even more contingent effects that only the original researchers can reproduce, which happens via them traversing the garden of forking paths even more. Here’s an example with p-values of .002 and .006.
Andrew Gelman makes a similar case.
Ignore citation counts. Given that citations are unrelated to (easily-predictable) replicability, let alone any subtler quality aspects, their use as an evaluative tool should stop immediately.
I am very on board with citation counts being terrible, but what should be used instead? If you evaluate based on predicted replicability, you incentivize research that says obvious things, e.g. “rain is correlated with wet sidewalks”.
I suspect that you probably could build a better and still cost-efficient evaluation tool, but it’s not obvious how.
Open data, enforced by the NSF/NIH. There are problems with privacy but I would be tempted to go as far as possible with this. Open data helps detect fraud. And let’s have everyone share their code, too—anything that makes replication/reproduction easier is a step in the right direction.
Seems good, though I’d want to first understand what purpose IRBs serve (you’d have to severely roll back IRBs for open data to become a norm).
Financial incentives for universities and journals to police fraud. It’s not easy to structure this well because on the one hand you want to incentivize them to minimize the frauds published, but on the other hand you want to maximize the frauds being caught. Beware Goodhart’s law!
I approve of the goal “minimize fraud”. This recommendation is too vague for me to comment on the strategy.
Why not do away with the journal system altogether? The NSF could run its own centralized, open website; grants would require publication there. Journals are objectively not doing their job as gatekeepers of quality or truth, so what even is a journal? A combination of taxonomy and reputation. The former is better solved by a simple tag system, and the latter is actually misleading. Peer review is unpaid work anyway, it could continue as is. Attach a replication prediction market (with the estimated probability displayed in gargantuan neon-red font right next to the paper title) and you’re golden. Without the crutch of “high ranked journals” maybe we could move to better ways of evaluating scientific output. No more editors refusing to publish replications. You can’t shift the incentives: academics want to publish in “high-impact” journals, and journals want to selectively publish “high-impact” research. So just make it impossible. Plus as a bonus side-effect this would finally sink Elsevier.
This seems to assume that the NSF would be more competent than journals for some reason. I don’t think the problem is with journals per se, I think the problem is with peer review, so if the NSF continues to use peer review as the author suggests, I don’t expect this to fix anything.
The author also suggests using a replication prediction market; as I mentioned above you don’t want to optimize just for replicability. Possibly you could have replication + some method of incentivizing novelty / importance. The author does note this issue elsewhere but just says “it’s a solvable problem”. I am not so optimistic. I feel like similar a priori reasoning could have led to the author saying “reproducibility will be a solvable problem”.

Rohin Shah 29 Aug 2022 14:16 UTC
LW: 41 AF: 15
4
AF
on: (My understanding of) What Everyone in Technical Alignment is Doing and Why
Note: I link to a bunch of stuff below in the context of the DeepMind safety team, this should be thought of as “things that particular people do” and may not represent the views of DeepMind or even just the DeepMind safety team.
I just don’t know much about what the [DeepMind] technical alignment work actually looks like right now
We do a lot of stuff, e.g. of the things you’ve listed, the Alignment / Scalable Alignment Teams have done at least some work on the following since I joined in late 2020:
- Eliciting latent knowledge (see ELK prizes, particularly the submission from Victoria Krakovna & Vikrant Varma & Ramana Kumar)
- LLM alignment (lots of work discussed in the podcast with Geoffrey you mentioned)
- Scalable oversight (same as above)
- Mechanistic interpretability (unpublished so far)
- Externalized Reasoning Oversight (my guess is that this will be published soon) (EDIT: this paper)
- Communicating views on alignment (e.g. the post you linked, the writing that I do on this forum is in large part about communicating my views)
- Deception + inner alignment (in particular examples of goal misgeneralization)
- Understanding agency (see e.g. discovering agents, most of Ramana’s posts)
And in addition we’ve also done other stuff like
I’m probably forgetting a few others.
I think you can talk about the agendas of specific people on the DeepMind safety teams but there isn’t really one “unified agenda”.
What links here?

Rohin Shah 11 May 2019 15:05 UTC
40 points
2
on: Disincentives for participating on LW/AF
Disincentives for me personally:
The LW/AF audience by and large operates under a set of assumptions about AI safety that I don’t really share. I can’t easily describe this set, but one bad way to describe it would be “the MIRI viewpoint” on AI safety. This particular disincentive is probably significantly stronger for other “ML-focused AI safety researchers”.
More effort needed to write comments than to talk to people IRL
By a lot. As a more extreme example, on the recent pessimism for impact measures post, TurnTrout and I switched to private online messaging at one point, and I’d estimate it was about ~5x faster to get to the level of shared understanding we reached than if we had continued with typical big comment responses on AF/LW.
What links here?

Rohin Shah 15 Jul 2023 17:14 UTC
38 points
6
in reply to: Thomas Kwa’s comment on: Why was the AI Alignment community so unprepared for this moment?
To add on my thinking in particular: my view for at least a couple of years was that alignment would go mainstream at some point and discourse quality would then fall. I didn’t really see a good way for me to make the public discourse much better—I am not as gifted at persuasive writing as (say) Eliezer, nor are my views as memetically fit. As a result, my plan has been to have more detailed / nuanced conversations with individuals and/or small groups, and especially to advise people making important decisions (and/or make those decisions myself), and that was a major reason I chose to work at an industry lab. I think that plan has fared pretty well, but you’re not going to see much evidence of that publicly.
I was, however, surprised by the suddenness with which things changed; had I concretely expected that I would have wanted the community to have more “huge asks” ready in advance. (I was instead implicitly thinking that the strength of the community’s asks would ratchet upwards gradually as more and more people were convinced.)

Rohin Shah 3 Oct 2019 17:35 UTC
38 points
in reply to: Raemon’s comment on: Noticing Frame Differences
LW Folk Game Theory is in fact not real game theory. The key difference is that LW Folk Game Theory tends to assume that positive utility corresponds to “I would choose this over nothing” while negative utility corresponds to “I would choose nothing over this”, and 0 utility is the indifference point.
Real Game Theory does not make such an assumption. In real game theory, you take actions that maximize your (expected) utility. Importantly, if you just add a constant to your utility function (for every possible action / outcome), then the maximizing action is not going to change—there’s no concept of “0 is the indifference point”. So, if there are two outcomes $o_{1}, o_{2}$ that can be achieved, and no others, then the utility function $U_{1} = {o_{1} : - 5, o_{2} : - 3}$ is identical to $U_{2} = {o_{1} : 5, o_{2} : 7}$ . In LW Folk Game Theory, “doing nothing” is usually an action and is assigned 0 utility by convention, which prevents this from happening.
If “positive sum games” isn’t really a thing I’d have expected to run into pushback about that at some point.
Consider a two player game where for any outcome $o$ , $U_{1} (o) + U_{2} (o) = 5$ . Sure sounds like a positive-sum game, right? Well, by the argument above, I can replace $U_{2}$ with $U_{2}^{'} = U_{2} - 5$ and the game remains exactly the same. And now we have $U_{1} (o) + U_{2}^{'} (o) = 0$ , that is, for every observation $U_{2}^{'} (o) = - U_{1} (o)$ , i.e. we’re in a zero-sum game.
As cousin_it said, really they shouldn’t be called zero-sum games, they should be called fixed-sum or constant-sum games. Two player constant-sum games are perfectly competitive, and as a result there are no threats: anything that hurts the other player helps you in exactly the same amount, and so you do it.
(As you note, if there are more than 2 players, you can get things like threats and collaboration, e.g. the weaker two players collaborate to overthrow the stronger one.)
Re: expecting pushback, I generally don’t expect LW terminology to agree particularly well with academia. The goals are different, and the terminology reflects this. LW wants to be able to compare everything to “nothing happened”, so there’s a convention that nothing happens = 0 utility. Real game theory doesn’t want to make that comparison, it prefers to have elegance and fewer assumptions.
LW “positive-sum games” means “both players are better off than if they did nothing”, or at least “one of the players is better off by an amount greater than the amount the other player is worse off”. Similarly for “negative-sum games”. This is fundamentally about comparing to “nothing happens”. Real game theory doesn’t care, it is all about action selection; and many games don’t have a “nothing happens” option. (See e.g. prisoner’s dilemma, where you must cooperate or defect, you can’t choose to leave the game.)
The thing I’m actually trying to contrast here is “the sort of strategic landscape, and orientation, where the thing to do is to fight over who wins social points, vs the sort of strategic landscape that encourages building something together.”
I usually call this competitive vs. collaborative, and games / strategies can be on a spectrum between competitive and collaborative. The maximally competitive games are two player zero sum games. The maximally collaborative games are common payoff games (where all players have the same utility function). Other games fall in between.
(where “fighting over who gets social points can still involve cooperation, but they it’s “allies at war” style cooperation that are dividing up spoils, rather than creating spoils)
Here it seems like there is both a collaborative aspect (maximizing the amount of spoils that can be shared between the two) and a competitive aspect (getting the largest fraction of the available spoils).
What links here?
- Rohin Shah's comment on “Zero Sum” is a misnomer. by abramdemski (6 Oct 2020 17:46 UTC; 2 points)

Rohin Shah 18 Aug 2023 7:41 UTC
LW: 37 AF: 20
20
AF
on: Against Almost Every Theory of Impact of Interpretability
I think I would particularly critique DeepMind and OpenAI’s interpretability works, as I don’t see how this reduces risks more than other works that they could be doing, and I’d appreciate a written plan of what they expect to achieve.
I can’t speak on behalf of Google DeepMind or even just the interpretability team (individual researchers have pretty different views), but I personally think of our interpretability work as primarily a bet on creating new affordances upon which new alignment techniques can be built, or existing alignment techniques can be enhanced. For example:
- It is possible to automatically make and verify claims about what topics a model is internally “thinking about” when answering a question. This is integrated into debate, and allows debaters to critique each other’s internal reasoning, not just the arguments they externally make.
  - (It’s unclear how much this buys you on top of cross-examination.)
- It is possible to automatically identify “cruxes” for the model’s outputs, making it easier for adversaries to design situations that flip the crux without flipping the overall correct decision.
  - Redwood’s adversarial training project is roughly in this category, where the interpretability technique is saliency, specifically magnitude of gradient of the classifier output w.r.t. the token embedding.
  - (Yes, typical mech interp directions are far more detailed than saliency. The hope is that they would produce affordances significantly more helpful and robust than saliency.)
  - A different theory of change for the same affordance is to use it to analyze warning shots, to understand the underlying cause of the warning shot (was it deceptive alignment? specification gaming? mistake from not knowing a relevant fact? etc).
I don’t usually try to backchain too hard from these theories of change to work done today; I think it’s going to be very difficult to predict in advance what kind of affordances we might build in the future with years’ more work (similarly to Richard’s comment, though I’m focused more on affordances than principled understanding of deep learning; I like principled understanding of deep learning but wouldn’t be doing basic research on interpretability if that was my goal).
My attitude is much more that we should be pushing on the boundaries of what interp can do, and as we do so we can keep looking out for new affordances that we can build. As an example of how I reason about what projects to do, I’m now somewhat less excited about projects that do manual circuit analysis of an algorithmic task. They do still teach us new stylized facts about LLMs like “there are often multiple algorithms at different ‘strengths’ spread across the model” that can help with future mech interp, but overall it feels like these projects aren’t pushing the boundaries as much as seems possible, because we’re using the same, relatively-well-vetted techniques for all of these projects.
I’m also more keen on applying interpretability to downstream tasks (e.g. fixing issues in a model, generating adversarial examples), but not necessarily because I think it will be better than alternative methods today, but rather because I think the downstream task keeps you honest (if you don’t actually understand what’s going on, you’ll fail at the task) and because I think practice with downstream tasks will help us notice which problems are important to solve vs. which can be set aside. This is an area where other people disagree with me (and I’m somewhat sympathetic to their views, e.g. that the work that best targets a downstream task won’t tackle fundamental interp challenges like superposition as well as work that is directly trying to tackle those fundamental challenges).
(EDIT: I mostly agree with Ryan’s comment, and I’ll note that I am considering a much wider category of work than he is, which is part of why I usually say “interpretability” rather than “mechanistic interpretability”.)
Separately, you say:
I don’t see how this reduces risks more than other works that they could be doing
I’m not actually sure why you believe this. I think on the views you’ve expressed in this post (which, to be clear, I often disagree with), I feel like you should think that most of our work is just as bad as interpretability.
In particular we’re typically in the business of building aligned models. As far as I can tell, you think that interpretability can’t be used for this because (1) it is dual use, and (2) if you optimize against it, you are in part optimizing for the AI system to trick your interpretability tools. But these two points seem to apply to any alignment technique that is aiming to build aligned models. So I’m not sure what other work (within the “build aligned models” category) you think we could be doing that is better than interpretability.
(Similarly, based on the work you express excitement about in your post, it seems like you are targeting an endgame of “indefinite, or at least very long, pause on AI progress”. If that’s your position I wish you would have instead written a post that was instead titled “against almost every theory of impact of alignment” or something like that.)

Rohin Shah 15 Nov 2021 12:10 UTC
37 points
on: Attempted Gears Analysis of AGI Intervention Discussion With Eliezer
In the comments to the OP that Eliezer’s comments about small problems versus hard problems got condensed down to ‘almost everyone working on alignment is faking it.’ I think that is not only uncharitable, it’s importantly a wrong interpretation [...]
Note that there is a quote from Eliezer using the term “fake”:
And then there is, so far as I can tell, a vast desert full of work that seems to me to be mostly fake or pointless or predictable.
It could certainly be the case that Eliezer means something else by the word “fake” than the commenters mean when they use the word “fake”; it could also be that Eliezer thinks that only a tiny fraction of the work is “fake” and most is instead “pointless” or “predictable”, but the commenters aren’t just creating the term out of nowhere.