paulfchristiano(Paul Christiano)

Karma: 27,058

paulfchristiano 2 Dec 2022 18:01 UTC
235 points
190
on: Jailbreaking ChatGPT on Release Day
Eliezer writes:
OpenAI probably thought they were trying hard at precautions; but they didn’t have anybody on their team who was really creative about breaking stuff, let alone as creative as the combined internet; so it got jailbroken in a day after something smarter looked at it.
I think this suggests a really poor understanding of what’s going on. My fairly strong guess is that OpenAI folks know that it is possible to get ChatGPT to respond to inappropriate requests. For example:
- They write “While we’ve made efforts to make the model refuse inappropriate requests, it will sometimes respond to harmful instructions.” I’m not even sure what Eliezer thinks this means—that they hadn’t actually seen some examples of it responding to harmful instructions, but they inserted this language as a hedge? That they thought it randomly responded to harmful instructions with 1% chance, rather than thinking that there were ways of asking the question to which it would respond? That they found such examples but thought that Twitter wouldn’t?
- These attacks aren’t hard to find and there isn’t really any evidence suggesting that they didn’t know about them. I do suspect that Twitter has found more amusing attacks and probably even more consistent attacks, but that’s extremely different from “OpenAI thought there wasn’t a way to do this but there was.” (Below I describe why I think it’s correct to release a model with ineffective precautions, rather than either not releasing or taking no precautions.)
If I’m right that this is way off base, one unfortunate effect would be to make labs (probably correctly) take Eliezer’s views less seriously about alignment failures. That is, the implicit beliefs about what labs notice, what skills they have, how decisions are made, etc. all seem quite wrong, and so it’s natural to think that worries about alignment doom are similarly ungrounded from reality. (Labs will know better whether it’s inaccurate—maybe Eliezer is right that this took OpenAI by surprise in which case it may have the opposite effect.)
(Note that I think that alignment is a big deal and labs are on track to run a large risk of catastrophic misalignment! I think it’s bad if labs feel that concern only comes from people underestimating their knowledge and ability.)
I think it makes sense from OpenAI’s perspective to release this model even if protections against harms are ineffective (rather than not releasing or having no protections):
1. The actual harms from increased access to information are relatively low; this information is available and easily found with Google, so at best they are adding a small amount of convenience (and if you need to do a song and dance and you get back your answer as a poem, you are not even more convenient).
2. It seems likely that OpenAI’s primary concern is with PR risks or nudging users in bad directions. If users need to clearly go out of their way to coax the model to say bad stuff, then that mostly addresses their concerns (especially given point #1).
3. OpenAI making an unsuccessful effort to solve this problem makes it a significantly more appealing target for research, both for researchers at OpenAI and externally. It makes it way more appealing for someone to outcompete OpenAI on this axis and say “see OpenAI was trying but failed, so our progress is cool” vs the world where OpenAI said “whatever, we can’t solve the problem so let’s just not even try so it does’t look like we failed.” In general I think it’s good for people to advertise their alignment failures rather than trying to somehow cover them up. (I think saying the model confidently false stuff all the time is a way bigger problem than the “jailbreaking,” but both are interesting and highlight different alignment difficulties.)
I think that OpenAI also likely has an explicit internal narrative that’s like “people will break our model in creative ways and that’s a useful source of learning, so it’s great for us to get models in front of more eyes earlier.” I think that has some truth to that (though not for alignment in particular, since these failures are well-understood internally prior to release) but I suspect it’s overstated to help rationalize shipping faster.
To the extent this release was a bad idea, I think it’s mostly because of generating hype about AI, making the space more crowded, and accelerating progress towards doom. I don’t think the jailbreaking stuff changes the calculus meaningfully and so shouldn’t be evidence about what they did or did not understand.
I think there’s also a plausible case that the hallucination problems are damaging enough to justify delaying release until there is some fix, I also think it’s quite reasonable to just display the failures prominently and to increase the focus on fixing this kind of alignment problem (e.g. by allowing other labs to clearly compete with OpenAI on alignment). But this just makes it even more wrong to say “the key talent is not the ability to imagine up precautions but the ability to break them up,” the key limit is that OpenAI doesn’t have a working strategy.

paulfchristiano 31 May 2023 20:12 UTC
200 points
86
on: Cosmopolitan values don’t come free
I want to keep picking a fight about “will the AI care so little about humans that it just kills them all?” This is different from a broader sense of cosmopolitanism, and moreover I’m not objecting to the narrow claim “doesn’t come for free.” But it’s directly related to the actual emotional content of your parables and paragraphs, and it keeps coming up recently with you and Eliezer, and I think it’s an important way that this particular post looks wrong even if the literal claim is trivially true.
(Note: I believe that AI takeover has a ~50% probability of killing billions and should be strongly avoided, and would be a serious and irreversible decision by our society that’s likely to be a mistake even if it doesn’t lead to billions of deaths.)
Humans care about the preferences of other agents they interact with (not much, just a little bit!), even when those agents are weak enough to be powerless. It’s not just that we have some preferences about the aesthetics of cows, which could be better optimized by having some highly optimized cow-shaped objects. It’s that we actually care (a little bit!) about the actual cows getting what they actually want, trying our best to understand their preferences and act on them and not to do something that they would regard as crazy and perverse if they understood it.
If we kill the cows, it’s because killing them meaningfully helped us achieve some other goals. We won’t kill them for arbitrarily insignificant reasons. In fact I think it’s safe to say that we’d collectively allocate much more than 1/millionth of our resources towards protecting the preferences of whatever weak agents happen to exist in the world (obviously the cows get only a small fraction of that).
Before really getting into it, some caveats about what I want to talk about:
- I don’t want to focus on whatever form of altruism you and Eliezer in particular have (which might or might not be more dependent on some potentially-idiosyncratic notion of “sentience.”) I want to talk about caring about whatever weak agents happen to actually exist, which I think is reasonably common amongst humans. Let’s call that “kindness” for the purpose of this comment. I don’t think it’s a great term but it’s the best short handle I have.
- I’ll talk informally about how quantitatively kind an agent is, by which I mean something like: how much of its resources it would allocate to helping weak agents get what they want? How highly does it weigh that part of its preferences against other parts? To the extent it can be modeled as an economy of subagents, what fraction of them are kind (or were kind pre-bargain)?
- I don’t want to talk about whether the aliens would be very kind. I specifically want to talk about tiny levels of kindness, sufficient to make a trivial effort to make life good for a weak species you encounter but not sufficient to make big sacrifices on its behalf.
- I’m not talking about whether the AI has spite or other strong preferences that are incompatible with human survival, I’m engaging specifically with the claim that AI is likely to care so little one way or the other that it would prefer just use the humans for atoms.
You and Eliezer seem to think there’s a 90% chance that AI will be <1/trillion (perhaps even a 90% chance that they have exactly 0 kindness?). But we have one example of a smart mind, and in fact: (i) it has tons of diverse shards of preference-on-reflection, varying across and within individuals (ii) it has >1/million kindness. So it’s superficially striking to be confident AI systems will have a million times less kindness.
I have no idea under what conditions evolved or selected life would be kind. The more preferences are messy with lots of moving pieces, the more probable it is that at least 1/trillion of those preferences are kind (since the less correlated the trillion different shards of preference are with one another and so the more chances you get). And the selection pressure against small levels of kindness is ~trivial, so this is mostly a question about idiosyncrasies and inductive biases of minds rather than anything that can be settled by an appeal to selection dynamics.
I can’t tell if you think kindness is rare amongst aliens, or if you think it’s common amongst aliens but rare amongst AIs. Either way, I would like to understand why you think that. What is it that makes humans so weird in this way?
(And maybe I’m being unfair here by lumping you and Eliezer together—maybe in the previous post you were just talking about how the hypothetical AI that had 0 kindness would kill us, and in this post how kindness isn’t guaranteed. But you give really strong vibes in your writing, including this post. And in other places I think you do say things that don’t actually add up unless you think that AI is very likely to be <1/trillion kind. But at any rate, if this post is unfair to you, then you can just sympathize and consider it directed at Eliezer instead who lays out this position much more explicitly though not in a convenient place to engage with.)
Here are some arguments you could make that kindness is unlikely, and my objections:
- “We can’t solve alignment at all.” But evolution is making no deliberate effort to make humans kind, so this is a non-sequitur.
- “This is like a Texas sharpshooter hitting the side of a barn then drawing a target around the point they hit; every evolved creature might decide that their own idiosyncrasies are common but in reality none of them are.” But all the evolved creatures wonder if a powerful AI they built would kill them or if if it would it be kind. So we’re all asking the same question, we’re not changing the question based on our own idiosyncratic properties. This would have been a bias if we’d said: humans like art, so probably our AI will like art too. In that case the fact that we were interested in “art” was downstream of the fact that humans had this property. But for kindness I think we just have n=1 sample of observing a kind mind, without any analogous selection effect undermining the inference.
- “Kindness is just a consequences of misfiring [kindness for kin / attachment to babies / whatever other simple story].” AI will be selected in its own ways that could give rise to kindness (e.g. being selected to do things that humans like, or to appear kind). The a priori argument for why that selection would lead to kindness seems about as good as the a priori argument for humans. And on the other side, the incentives for humans to be not kind seem if anything stronger than the incentives for ML systems to not be kind. This mostly seems like ungrounded evolutionary psychology, though maybe there are some persuasive arguments or evidence I’ve just never seen.
- “Kindness is a result of the suboptimality inherent in compressing a brain down into a genome.” ML systems are suboptimal in their own random set of ways, and I’ve never seen any persuasive argument that one kind of suboptimality would lead to kindness and the other wouldn’t (I think the reverse direction is equally plausible). Note also that humans absolutely can distinguish powerful agents from weak agents, and they can distinguish kin from unrelated weak agents, and yet we care a little bit about all of them. So the super naive arguments for suboptimality (that might have appealed to information bottlenecks in a more straightforward way) just don’t work. We are really playing a kind of complicated guessing game about what is easy for SGD vs easy for a genome shaping human development.
- “Kindness seems like it should be rare a priori, we can’t update that much from n=1.” But the a priori argument is a poorly grounded guess about about the inductive biases of spaces of possible minds (and genomes), since the levels of kindness we are talking about are too small to be under meaningful direct selection pressure. So I don’t think the a priori arguments are even as strong as the n=1 observation. On top of that, the more that preferences are diverse and incoherent the more chances you have to get some kindness in the mix, so you’d have to be even more confident in your a priori reasoning.
- “Kindness is a totally random thing, just like maximizing squiggles, so it should represent a vanishingly small fraction of generic preferences, much less than 1/trillion.” Setting aside my a priori objections to this argument, we have an actual observation of an evolved mind having >1/million kindness. So evidently it’s just not that rare, and the other points on this list respond to various objections you might have used to try to salvage the claim that kindness is super rare despite occurring in humans (this isn’t analogous to a Texas sharpshooter, there aren’t great debunking explanation for why humans but not ML would be kind, etc.). See this twitter thread where I think Eliezer is really off base, both on this point and on the relevance of diverse and incoherent goals to the discussion.
Note that in this comment I’m not touching on acausal trade (with successful humans) or ECL. I think those are very relevant to whether AI systems kill everyone, but are less related to this implicit claim about kindness which comes across in your parables (since acausally trading AIs are basically analogous to the ants who don’t kill us because we have power).
A final note, more explicitly lumping you with Eliezer: if we can’t get on the same page about our predictions I’m at at least aiming to get folks to stop arguing so confidently for death given takeover. It’s easy to argue that AI takeover is very scary for humans, has a significant probability of killing billions of humans from rapid industrialization and conflict, and is a really weighty decision even if we don’t all die and it’s “just” handing over control over the universe. Arguing that P(death|takeover) is 100% rather than 50% doesn’t improve your case very much, but it means that doomers are often getting into fights where I think they look unreasonable.
I think OP’s broader point seems more important and defensible: “cosmopolitanism isn’t free” is a load-bearing step in explaining why handing over the universe to AI is a weighty decision. I’d just like to decouple it from “complete lack of kindness.”
What links here?

paulfchristiano 12 Jul 2022 18:49 UTC
LW: 176 AF: 76
56
AF
on: On how various plans miss the hard bits of the alignment challenge
I’m going to spend most of this comment responding to your concrete remarks about ELK, but I wanted to start with some meta level discussion because it seems to cut closer to the heart of the issue and might be more generally applicable.
I think a productive way forward (when working on alignment or on other research problems) is to try to identify the hardest concrete difficulties we can understand then try to make progress on them. This involves acknowledging that we can’t anticipate all possible problems, but expecting that solving the concrete problems is a useful way to make steps forward and learn general lessons. It involves solving individual challenges, even if none of them will address the whole problem, and even if we have a vague sense that further difficulties will arise. It means not becoming too pessimistic about a direction until we see fairly concretely where it’s stuck, partially because we hope that zooming in on a very concrete case where you get stuck is the main way to eventually make progress.
My sense is that you have more faith in a rough intuitive sense you’ve developed of what the “hard part” of alignment is, and so you’d primarily recommend thinking about that until we feel less confused. I disagree in large part because I feel like your broad intuitive sense has not yet had much opportunity to make contact with either reality or with formal reasoning, and I’d guess it’s not precise enough to be a useful guide to research prioritization.
More concretely, you talk about novel mechanisms by which AI systems gain capabilities, but I think you haven’t said much concrete about why existing alignment work couldn’t address these mechanisms. This looks to me like a pretty unproductive stance; I suspect you are wrong about the shape of the problem, but if you are right then I think your main realistic path to impact involves saying something more concrete about why you think this.
I think you don’t see the situation the same way, probably because you feel like you have said plenty concrete. Perhaps this is the most serious disagreement of all. I don’t think saying there is a “capabilities well” is helpfully concrete until you say something about what it looks like, why it poses alignment problems different from SGD and why particular approaches don’t generalize, etc.
In ARC’s day to day work we write down particular models of capabilities that would generalize far outside of training (e.g.: what about a causal model of the world that holds robustly? what about logical deduction from valid premises with longer chains of reasoning? what about continuing to learn by trial and error when deployed in a novel environment?), and ask about whether a given alignment solution would generalize along with them. If we can find any gap, then that it goes on the list of problems. We focus on the gaps that seem least likely to be addressable by using known techniques, and try to develop new techniques or to identify general reasons why the gap is unresolvable.
My guess is that you are playing a roughly similar game much more informally, and that you are just making a mistake because reasoning about this stuff is in fact hard. But I can’t really tell, since your thinking is happening in private and we are seeing the vague intuitions that result. (I’ve been hanging around MIRI for a long time, and I suspect I have a better model of your and Eliezer’s position than virtually anyone else outside of MIRI, yet this is still where I’m at.)
Anyway, now turning to your discussion of ELK in particular.
Your first problem is that the recent capabilities gains made by the AGI might not have come from gradient descent (much like how humans’ sudden explosion of technological knowledge accumulated in our culture rather than our genes, once we turned the corner). You might not be able to just “expose the bad behavior” to gradients that you can hit to correct the thing, at least not easily and quickly.
I often think and write about other places where capabilities may come from that could challenge our basic alignment plan. Four particularly salient examples:
1. Your AI might perform search internally, e.g. looking for hypotheses that match the data or for policies that work well.
2. Natural selection may occur internally, e.g. cognitive patterns that acquire power might tend to dominate the behavior of your AI (despite the AI having no explicit prediction that they would work well).
3. Your AI might reason about how to think better, e.g. select cognitive actions based on anticipated consequences of those cognitive actions.
4. Our AI might deploy new algorithms that pose their own alignment risk for different (potentially unanticipated) reasons.
Some of these represent real problems, but none of them seem to fundamentally change the game or be deal-breakers:
1. Aligning the internal search seems very similar to aligning SGD on the outside. We could distinguish two additional difficulties in this case:
  1. Because the search is on the inside, we can’t directly apply our alignment insights to align it. Instead we need to ensure that SGD learns to align the search. This itself poses two difficulties: (a) the outer gradient needs to incentivize doing this, (b) we need to argue that it’s nearly as easy for SGD to learn the aligned search as to learn the unaligned search (or build scaffolding such that it becomes similarly easily). This is what we’re talking about in this appendix, and it’s part of why we are skeptical about approaches to ELK based on simple regularizers. But we don’t see a reason that either (a) or (b) would be a dealbreaker, and we tentatively think our current approaches to ontology identification would at least solve (a) if they were successful at all. It’s pretty hard to talk about (b) without having more clarity about what the alignment scheme actually looks like but we don’t see an in principle reason it’s hard.
  2. The internal search algorithm may not be SGD, and perhaps our alignment strategy was specific to some detail of SGD. But SGD appears to be amongst the hardest search algorithms, and ARC tries to pursue approaches that work for other algorithms rather than leveraging anything about SGD in particular. We’re definitely in the market for other search algorithms that cause trouble but don’t yet know of any.
2. Natural selection on the inside is similar but potentially more tricky, because the optimizer has more limited control over how this search works. This is like the analog of memetic selection being smarter than humans and eventually overpowering or hijacking human consequentialism. Another extreme example is that it seems like a large enough neural network may be catastrophically misaligned at initialization simply because of selection amongst activation patterns within a single forward pass. Ultimately we’d like to handle this in exactly the same way that we handle the last point, by some combination of (a) we can just directly apply the same hope from the previous section even to natural selection, (b) we can run explicit searches that are more powerful than implicit search by natural selection within our model, which requires ensuring that our explicit learned search captures whatever is good about natural selection (this seems tough but not at all obvious impossible to me). It’s hard to talk about option (a) without seeing if/how we solve the problem from the last point. We could definitely work on option (b) now, and a large enough ARC would be working on it, but it seems like a relatively low priority since it’s both very remote from existing systems and seems relatively unlikely-to-me to be the simplest place where we get stuck.
3. If our model is selecting cognitive actions, or designing new algorithms, then our core hope is that an aligned model will try to think in an aligned way. So if we’ve been succeeding at alignment so far then the model will be trying to stay aligned. By analogy, if humans cared about the amount of human DNA in the universe, then to the extent that cultural evolution was guided by human consequentialism (rather than e.g. being memetic selection), we would be trying to develop cultural machinery that was helpful for maximizing the amount of human DNA in the universe.
4. One way this can go wrong is if our model wants to stay aligned but fails, e.g. because it identifies new techniques for thinking that themselves pose new alignment difficulties (just as we desire human flourishing but may instead implement AI systems that want paperclips). I think this is a real problem, but there are a lot of reasons I don’t consider it an existential challenge for our approach:
  1. If you’ve succeeded at alignment so far, then your AI will also consider this a problem and will be trying to solve it. I think we should relate to our AI, discovering new ways to think that might pose new alignment difficulties, in the same way that we relate to future humans who may encounter alignment difficulties. The AI may solve the problem, or may implement policy solutions, or etc., and our role is to set them up for success just like we are trying to set up future humans for success. AI compresses the timescale both for “new AI algorithms with new alignment problems” but also for all of the solutions to those problems, so I don’t think it changes the game from future humans. And so I’d focus on prosaic AI alignment for exactly the same reasons I focus on prosaic AI alignment when trying to help future humans succeed at alignment.
  2. I think that we should be considering the particular algorithms that might pose a new alignment problem, and trying to solve alignment for each of them. If we have some general reason to think that new algorithms will be much harder than old algorithms, or that lessons won’t transfer, then we can discuss those and whether they should affect research prioritization. So far I don’t think we have such arguments, and so I think we should just be looking for algorithms that might pose problems. (I don’t actually think that’s the highest priority, because prosaic ML so obviously poses problems, and the other problems we see seem so closely analogous to the ones posed by prosaic ML. But I’m certainly in the market for other problems and think that a large enough research community should already be actively looking for them.)
Possible disagreements between us: (i) you think that at least one of these examples looks really bad for our approach, (ii) you have other examples in mind, (iii) you don’t think we can write down a concrete example that looks bad, but we have reason to expect other kinds of capability gains that will be bad, (iv) nothing looks like a dealbreaker in particular, but it’s just contributing to a long list of problems you’d have to solve and that’s either a lot of work or something probably won’t work out.
For me, the upshot of all of this is that SGD poses some obvious problems, that those problems are the most likely to actually occur, that they seem similar to (and at least subproblems of) the other alignment problems we may face, and that there are neither super compelling alternatives to aligning SGD nor particular arguments that the rest of the problem is harder than this step.
Your second problem is that the AGI’s concepts might rapidly get totally uninterpretable to your ELK head. Like, you could imagine doing neuroimaging on your mammals all the way through the evolution process. They’ve got some hunger instincts in there, but it’s not like they’re smart enough yet to represent the concept of “inclusive genetic fitness” correctly, so you figure you’ll just fix it when they get capable enough to understand the alternative (of eating because it’s instrumentally useful for procreation). And so far you’re doing great: you’ve basically decoded the visual cortex, and have a pretty decent understanding of what it’s visualizing.
Our goal is to learn a reporter that describes the latent knowledge of the model, and to keep this up to date as the model changes under SGD. If thinking about SGD, we usually think concretely about a single step of SGD, and how you could find a good reporter at the end of that gradient descent step assuming you had one at the beginning.
It feels to me like what you are saying here is just “you might not be able to solve ELK.” Or else maybe restating the previous point, that the model builds latent knowledge by mechanisms other than SGD and therefore you need to learn a reporter that can also follow along with those other mechanisms.
In either case, I can’t speak to whether it’s helpful for the audience understanding why ELK is hard, but it is certainly not helping me understand why you think ELK is hard. I think this discussion is just too vague to be helpful.
I think it’s not crazy for you to say “ARC’s hopes about how to solve ELK are too vague to seem worth engaging with” (this is pretty similar to me saying “Nate’s arguments about why alignment is hard are too vague to seem worth engaging with”).
Analogously, your ELK head’s abilities are liable to fall off a cliff right as the AGI’s capabilities start generalizing way outside of its training distribution.
But can you say something concrete about why? What I’d like to do is talk about what the AGI is actually thinking, the particular computation it’s running, so that we can talk about why that computation keeps being correlated with reality off distribution and then ask whether the reporter remains correlated with reality. When I go through this exercise I don’t see big dealbreakers, and I can’t tell if you disagree with that diagnosis, or if you are noticing other things that might be going on inside the AI, or if the difference is that I think “this looks like it might work in all the concrete cases we can see” is a relevant signal and you think “nah the cases we can’t see are way worse than those we can see.”
And if they don’t, then this ELK head is (in this hypothetical) able to decode and understand the workings of an alien mind. Likely a kludgey behemoth of an alien mind. This itself is liable to require quite a lot of capability, quite plausibly of the sort that humanity gets first from the systems that took sharp left-turns, rather than systems that ground along today’s scaling curves until they scaled that far.
Again, this seems too vague to be helpful, or perhaps just mistaken. The reporter is not some other AI looking at your predictor and trying to “decode its workings,” or maybe it is but if so it’s just because those english words are vague and broad. Can we talk about the particular kinds of cognition that your AI might be performing, such that you don’t think this works? (Or which would require the reporter to itself be using magic-mystery-juice-of-intelligence?)
That’s really the central theme of my response, so it’s worth restating: ARC loves examples of ways an AI might be thinking such that ELK is difficult. But your description of the sharp left turn is too vague to be helpful for this purpose, and so I’d either like to turn this into more concrete discussion of the internals of the algorithm, or else some significantly more precise argument about why we expect the unknown possible internals to be so much less favorable for ELK than any of the concrete examples we can write down.^[1]
1. ^
  I’d like to head off a possible response you might make that I disagree with: “Sure your algorithm works for any example you can write down, but the whole point is that you need it to work for alien cognition, where humans don’t understand why it works. So of course it works on concrete examples but not in the unknown real world.” . I’m putting this in a footnote because it seems like a digression and I have no idea if this is your view.
  My main response is that we can in fact talk about concrete examples where “why your AI system’s cognition works” isn’t accessible to humans in the relevant ways:
  We can consider tricky facts we understand about how to reason, for which our discovery of those facts is empirically contingent (and where discovering those facts is harder than discovering the reasons itself). Then we can consider whether our AI alignment strategies would work even if humans hadn’t figured out the relevant facts about reasoning.
  We can consider AI cognition which is contingent on hypothesized unknown-to-human facts, e.g. about the causal structure of reality, or about key facts about mathematics, or whatever else.
  Most of our ELK approaches don’t make no-holds-barred use of “can a human come up with some story about why this AI cognition may work,” and so this just isn’t a particularly salient threshold anyway. As a silly example, if you were solving this problem with a speed prior (or indeed with any of the approaches in the regularization section of the ELK document) you wouldn’t expect a particular key threshold at the space of strategies that a human understands.
What links here?

paulfchristiano 24 Nov 2021 22:05 UTC
LW: 159 AF: 53
0
AF
on: Yudkowsky and Christiano discuss “Takeoff Speeds”
I stand ready to bet with Eliezer on any topic related to AI, science, or technology. I’m happy for him to pick but I suggest some types of forecast below.
If Eliezer’s predictions were roughly as good as mine (in cases where we disagree), then I would update towards taking his views more seriously. Right now it looks to me like his view makes bad predictions about lots of everyday events.
It’s possible that we won’t be able to find cases where we disagree, and perhaps that Eliezer’s model totally agrees with mine until we develop AGI. But I think that’s unlikely for a few reasons:
- I constantly see observations that seem like evidence for Eliezer’s views (e.g. any time I see an ML paper with a surprisingly large effect size, or ML labs failing to make investments in scaling, or people being surprisingly unreasonable), it’s just that I see significantly more evidence against his views. The point of making bets in advance is that it can correct for my hindsight bias or for my inability to simulate “what Eliezer’s view would say about this.” Eliezer could also say that actually all of the observations I listed aren’t evidence for his view, which would be interesting to me.
- Eliezer frequently talks smack about how the real world is surprising to fools like Paul (e.g. he talks about the “sort of person who gets taken in by Hanson’s arguments in 2008 and gets caught flatfooted by AlphaGo and GPT-3 and AlphaFold 2”,). If that’s right, then it must correspond to differences in prediction. And if Eliezer literally can’t state where he expects to make better predictions than me other than AGI then I think people should mostly ignore the bluster and he should probably cut it out.
- Eliezer frequently acknowledges “sure, lines look straight in hindsight, but that’s not how they look at the time.” But to me it looks like lines are also (mostly) straight even with foresight. How could this not correspond to some difference in prediction? I’d be happy to use historical case studies instead of predictions, but Eliezer thinks you need to make them in advance—so I’m happy to just apply my straight-line-extrapolation methodology to arbitrary near-term forecasts. I think Eliezer would prefer that I somehow make predictions and evaluate them in absolute terms rather than by comparing to Eliezer’s predictions, but that’s not what at’s issue—I think my forecasts are more accurate than Eliezer’s, not that they meet some absolute bar of quality.
- When trying to define bets, I think we get stuck at the stage where Eliezer isn’t giving probability distributions over quantitative measures, not the stage where Eliezer gives them but they’re the same as mine. My tentative guess is that Eliezer can’t predict what-Paul-would-call-a-reasonable-forecast, rather than understanding what Paul would forecast but disagreeing with it. This is related to disagreements over how to interpret the past evidence. I’m less clear on whether I can simulate Eliezer forecasts.
Anyway, I think Eliezer should probably pick a domain where he thinks his model shines. But I’m going to propose some domains where I expect to find disagreements and where I expect to beat his model, just to help get the ball rolling:
- Performance on any ML benchmark in 1, 2, 5 years. Happy to propose examples (basically taking those from existing work) in theorem proving, standard NLP, mathematical reasoning, or coding.
- Performance on any interesting real-world tasks where we can readily define the task in 1, 2, 5 years. Happy to propose examples on e.g. translation, picking robots, self-driving cars.
- Signs of impact from various kinds of AI in 2, 5, 10 years, e.g. coding, marketing copy, industrial robotics, self-driving cars, translation, whatever.
- Progress in performance or adoption for non-AI technologies, e.g. energy (solar, fission, fusion, wind…), various parts of biotech or materials science, whatever.
- Total investment in AI research of various kinds, either in the industry overall or at particular labs.
- Total valuation of AI companies, hardware companies, or whatever.
- Sizes of improvements over SOTA from ML papers in various domains.
- Relative success of different ML approaches, e.g. importance of architectural changes vs transformers, how much gradient descent will play a role in future results, meta-learning vs fine-tuning…
- Specific claims about model sizes, training costs, the role of planning, etc. in high-profile results.
I’m happy to provide more specific operationalizations and questions in any of those domains, if there are any categories where Eliezer is up for actually forecasting.
The high-level patterns that I think will generate lots of moderate lower-level disagreements:
- I expect things to be significantly more incremental and “boring.” I put smaller probabilities on trend breaks and big jumps, and I have a strong sense for many kinds of metrics that move more regularly. I think Eliezer literally can’t tell how to translate this heuristic into predictions, which is part of why I think he is going to predictably make bad predictions.
- I think I have more understanding of modern AI in particular, so I expect to make better predictions for boring reasons for anything in that space.
- I generally expect a continuing ramp-up in AI investment and effort, and for that to lead to predictable changes as the field scales.
- I have a different picture of how AI will work where AGI is not special and so won’t affect any evaluations of tasks in the near future, leading to more “boring” claims about the hardness of different tasks (though not sure this will generate disagreements within 5 years).
We might try to use operationalizations like: “In how many of these 10 quantities is there a year with 4x more change than any previous year” (h/t holden), or “How much of the economic value of AI comes from applications whose value has more than doubled in the last year?” or “For each of these pairs of capabilities, which will happen first if at least one happens in the next 5 years?” or so on. But even if we can’t find something clever, I feel like the differences in quantitative view are stark enough that we’re just going to disagree about a bunch of numbers.
I would prefer state predictions and discuss rationales publicly, allow some informed folks to kibitz, and then revise based on people pointing out facts we don’t know, since I think that makes it cheaper to make forecasts and reduces the probability that the test is decided by specific facts rather than a general view.
What links here?

paulfchristiano 27 Aug 2022 19:41 UTC
LW: 153 AF: 58
30
AF
in reply to: Adam Scholl’s comment on: Common misconceptions about OpenAI
A common view is that the timelines to risky AI are largely driven by hardware progress and deep learning progress occurring outside of OpenAI. Many people (both at OpenAI and elsewhere) believe that questions of who builds AI and how are very important relative to acceleration of AI timelines. This is related to lower estimates of alignment risk, higher estimates of the importance of geopolitical conflict, and (perhaps most importantly of all) radically lower estimates for the amount of useful alignment progress that would occur this far in advance of AI if progress were to be slowed down. Below I’ll also discuss two arguments that delaying AI progress would on net reduce alignment risk which I often encountered at OpenAI.
I think that OpenAI has had a meaningful effect on accelerating AI timelines and that this was a significant cost that the organization did not adequately consider (plenty of safety-focused folk pushed back on various accelerating decisions and this is ultimately related to many departures though not directly my own). I also think that OpenAI is significantly driven by the desire to do something impactful and to reap the short-term benefits of AI. In significant part that’s about wanting to be involved in altruistic benefits (though it’s also based on a more basic and generally scary desire to just do something impactful). I think that OpenAI folks’ views on altruistic benefits are based on some claims I agree with about possible impacts, but also on them caring less than I do about future generations and by having what I regard as mistaken empirical views (which partly persist because many folks have underinvested in careful thinking about the future).
That said, I think that the LW community significantly overestimates the negative impact of OpenAI’s timeline-accelerating effects to date, and I suspect that these do not dominate their net impacts (neither do the claims about disrupting a relatively flimsy “only DeepMind works on AGI” equilibrium). That still leaves room for debate about whether the other impacts are positive or negative.
It’s worth being aware of some common arguments that acceleration is less bad than it looks or even net positive:
- I think it’s basically reasonable to think that MIRI and the broader AI safety community made very little meaningful progress over the last 10 years, and to have the view that the overwhelmingly dominant drivers of accelerating alignment progress have been and will continue to be increased interest and investment as AI improves (this seems wrong to me in large part because the AI safety community and EA community more broadly have been growing independent of increased interest in AI). If that were the case, then cutting one month off of AI timelines does not have much direct effect on our ability to manage AI risk via giving us more time for alignment research, and the calculus is instead dominated by other trends in the world are positive or negative (e.g. how much do you think general institutional capacity is improving vs deteriorating over the coming decades, how worried are you about rise of China relative to the west, etc.)
- Another fairly common argument and motivation at OpenAI in the early days was the risk of “hardware overhang,” that slower development of AI would result in building AI with less hardware at a time when they can be more explosively scaled up with massively disruptive consequences. I think that in hindsight this effect seems like it was real, and I would guess that it is larger than the entire positive impact of the additional direct work that would be done by the AI safety community if AI progress had been slower 5 years ago. I think the LW community considers this argument non-serious, but in my opinion (and I expect the judgment of most independent observers) the empirical track record of this community and Eliezer on the relevant AI forecasts seems bad enough that no one should be taking community consensus on that point as a source of independent evidence.
I think that both of those arguments were significantly more plausible in the past and particularly before the release of GPT-3, though I still think they were wrong and likely in significant part the result of motivated cognition (or more realistically memetic and political selection within OpenAI and the adjacent communities).
At this point I think it’s fairly clear that if OpenAI were focused on making the long-term future good they should not be disclosing or deploying improved systems (and it seems likely to me that they should not even be developing them), so the main point of debate is exactly how bad it is. I think it’s less obvious whether it is good or bad on a certain kind of myopic altruism since I’d guess that the cost of 1 year of acceleration is less than a 1% reduction in survival probability (while ~1% of people die each year and people might reasonably value the profound suffering that occurs over a single year at 1% of survival).
Overall I think the LW community tends to be kind of deontological about this, and that when making quantitative estimates they tend to be at best debatable (and wildly overconfident and aggressive). I’d guess these overall decrease the efficiency of the LW community as a good influence on labs or force for good in the world.
What links here?
- Trying to disambiguate different questions about whether RLHF is “good” by Buck (14 Dec 2022 4:03 UTC; 106 points)
- Why deceptive alignment matters for AGI safety by Marius Hobbhahn (15 Sep 2022 13:38 UTC; 57 points)

paulfchristiano 11 Jul 2019 8:42 UTC
153 points
on: The AI Timelines Scam
I agree with:
- Most people trying to figure out what’s true should be mostly trying to develop views on the basis of public information and not giving too much weight to supposed secret information.
- It’s good to react skeptically to someone claiming “we have secret information implying that what we are doing is super important.”
- Understanding the sociopolitical situation seems like a worthwhile step in informing views about AI.
- It would be wild if 73% of tech executives thought AGI would be developed in the next 10 years. (And independent of the truth of that claim, people do have a lot of wild views about automation.)
I disagree with:
- Norms of discourse in the broader community are significantly biased towards short timelines. The actual evidence in this post seems thin and cherry-picked. I think the best evidence is the a priori argument “you’d expect to be biased towards short timelines given that it makes our work seem more important.” I think that’s good as far as it goes but the conclusion is overstated here.
- “Whistleblowers” about long timelines are ostracized or discredited. Again, the evidence in your post seems thin and cherry-picked, and your contemporary example seems wrong to me (I commented separately). It seems like most people complaining about deep learning or short timelines have a good time in the AI community, and people with the “AGI in 20 years” view are regarded much more poorly within academia and most parts of industry. This could be about different fora and communities being in different equilibria, but I’m not really sure how that’s compatible with “ostracizing.” (It feels like you are probably mistaken about the tenor of discussions in the AI community.)
- That 73% of tech executives thought AGI would be developed in the next 10 years. Willing to bet against the quoted survey: the white paper is thin on details and leaves lots of wiggle room for chicanery, while the project seems thoroughly optimized to make AI seem like a big deal soon. The claim also just doesn’t seem to match my experience with anyone who might be called tech executives (though I don’t know how they constructed the group).

paulfchristiano 15 Apr 2022 20:34 UTC
LW: 151 AF: 56
AF
on: Everything I Need To Know About Takeoff Speeds I Learned From Air Conditioner Ratings On Amazon
I agree that people can easily fail to fix alignment problems, and can instead paper over them, even given a long time to iterate. But I’m not really convinced about your analogy with single-hose air conditioners.
Physics:
The air coming out of the exhaust is often quite a bit hotter than the outside air. I’ve never checked myself, but just googling has many people reporting 130+ degree temperatures coming out of exhaust from single-hose units. I’m not sure how hot this unit’s exhaust is in particular, but I’d guess it’s significantly hotter than outside air.
If exhaust is 130 and you are trying to cool from 100 to 70 you’d then only be losing 50% efficiency. Most people won’t be cooling by 30 degrees so the efficiency losses would be smaller. In practice I think the actual efficiency loss relative to a 2-hose unit is more like 25-30% (see stats on top wirecutter picks below).
Discourse:
I actually think that this factor(sucking in hot air from the outside) is probably already included in the SACC (seasonally adjusted cooling capacity) and hence CEER reported for this air conditioner. I don’t really know anything about air conditioners but it’s discussed extensively in the definition of the standards for SACC (e.g. start at page 27 here).
A 2-hose unit will definitely cool more efficiently, but I think for many people who are using portable units it’s the right tradeoff with convenience. The wirecutter reviews both types of units together and usually end up preferring 1-hose units. Infiltration is discussed in the wirecutter article and in other articles advising people on whether to pick a 1-hose or 2-hose portable unit.
Meta:
Obviously the point about air conditioners doesn’t matter, but I feel like the general lesson is relevant. It’s important to be able to call the world on its bullshit (because I agree there is a lot of it), but that seems like it works better when coupled with discernment about what is and is not bullshit.
It’s particularly jarring to notice a huge conflict between a clever argument and a lot of people’s reported experiences, and then to have the confidence to not only believe the clever argument but to not even think the conflict is worth acknowledging or exploring.
What links here?

paulfchristiano 25 Jun 2022 20:55 UTC
139 points
67
on: Conversation with Eliezer: What do you want the system to do?
This is a frequent disagreement I have with Eliezer and he seems to consistently find my view either perplexing or obviously misguided. So I guess this is as good a place as any to express that view:
- I want AI to do a wide variety of things like:
  - Run factories, write software, manage militaries etc, and so these things about as well as an unaligned AI such that humans can reasonably expect to hold their own in a conflict with a smaller coalition using unaligned AI liberally.
  - Help make further progress on alignment, design/negotiate/enforce agreements between labs that reduce the risk of deploying unaligned AI, etc. and do these things about as well as possible given the underlying ML technology.
  - Generally make policy, enforce the law, forecast and respond to technological risks, make grants and run projects and do the whole EA thing, etc. in a way that helps us navigate alignment and also the other risks that will emerge rapidly in a faster-moving world. I think the good outcome is “like humans but faster.”
- I think that the technical problem “build competitive AI that doesn’t disempower humanity” is generally better than a problem like “build AI that can build nanotech without disempowering humanity,” in the sense that I strongly think people should have “competitive AI alignment” in mind as a goal day to day, rather than trying to tell a story about how their AI does some particular pivotal act. This is a non-trivial methodological claim (though so is Eliezer’s).
- Realistically I think the core issue is that Eliezer is very skeptical about the possibility of competitive AI alignment. That said, I think that even on Eliezer’s pessimistic view he should probably just be complaining about competitiveness problems rather than saying pretty speculative stuff about what is needed for a pivotal act.
- This is partly because I think the kind of story that Eliezer tells about AI building nanotechnology or brain emulations (or whatever other pivotal act he is imagining) doesn’t reflect how automated R&D is likely to actually look. It looks like it’s totally plausible for many kinds of limited systems to greatly accelerate R&D, and when Eliezer starts making concrete claims about what different kinds of systems can and can’t do I think he’s on pretty shaky ground (and e.g. I think this is where it’s most likely he’s going to be wrong if he tries to cash out this view as predictions about what AI can and can’t do in the near term). This is probably the second big disagreement, and is extremely important for these discussions.
- I am supportive of Eliezer’s general interest in pushing people to talk concretely, and I think that there is an important sense in which you really should be highly skeptical of any abstract story you can’t make concrete. I think that some proposals wouldn’t meaningfully reduce risk because proponents don’t have a realistic concrete scenario in mind, and their optimistic scenario is only able to seem realistic because it avoids being concrete. Unfortunately, I think that Eliezer jumps for this explanation way too quickly in general (I think this an instance of Eliezer having a library of 10-100 cognitive errors that he attributes whenever possible as an explanation for a disagreement). I think this can make it really painful to talk with Eliezer.
- When Eliezer talks concretely about possible futures it feels to me like he wants to have a very simplified story of the world, and is very unhappy when answers to “what does the AI do in 2030?” are anywhere near as complicated as “what do humans do in 2020?” For example, I think his methodology for talking about the world, and his practical method for diagnosing when someone has no concrete picture, would basically not work for someone living in 1800 who had a crystal ball looking at 2000. This would be easier to discuss in the context of more details about discussions with Eliezer.
- This is exacerbated by Eliezer’s desire to focus on what you might call the “endgame,” asking about what’s happening in a world where AI greatly outstrips humanity. I suspect that this world is mostly shaped by AIs, whether things are going well or poorly, and so it really is more like living in 1600 and talking about 2000. It’s meaningful to talk concretely about such a wildly different world full of people who know things you don’t. But I think you need to be aware of that when thinking about how to decide whether a story is realistic or not.
- As mentioned, I think those two things are likely downstream of Eliezer having high conviction that AI needs to do a narrow pivotal task in order to be safely alignable. But I think that the substance here is in a set of claims about technical alignment, AI capabilities, and policy, and that it’s burying the lede to frame this as being about “people don’t think concretely about what their AI needs to do.”
- I think a fair number of people are confused by this kind of question from Eliezer because it’s so obvious or natural that aligned AI would be used for a very broad variety of tasks and that it’s obviously hard to talk specifics without making predictably-false claims about the future. There’s still a game that Eliezer is inviting them to play, but he should more understand that this game is not a natural and simple game on other perspectives, so it’s going to take time to communicate it, especially when Eliezer is constantly making a bunch of background assumptions that other people aren’t into.
What links here?

paulfchristiano 25 Jul 2022 2:17 UTC
LW: 135 AF: 61
63
AF
on: Reward is not the optimization target
At some level I agree with this post—policies learned by RL are probably not purely described as optimizing anything. I also agree that an alignment strategy might try to exploit the suboptimality of gradient descent, and indeed this is one of the major points of discussion amongst people working on alignment in practice at ML labs.
However, I’m confused or skeptical about the particular deviations you are discussing and I suspect I disagree with or misunderstand this post.
As you suggest, in deep RL we typically use gradient descent to find policies that achieve a lot of reward (typically updating the policy based on an estimator for the gradient of the reward).
If you have a system with a sophisticated understanding of the world, then cognitive policies like “select actions that I expect would lead to reward” will tend to outperform policies like “try to complete the task,” and so I usually expect them to be selected by gradient descent over time. (Or we could be more precise and think about little fragments of policies, but I don’t think it changes anything I say here.)
It seems to me like you are saying that you think gradient descent will fail to find such policies because it is greedy and local, e.g. if the agent isn’t thinking about how much reward it will receive then gradient descent will never learn policies that depend on thinking about reward.
(Though I’m not clear on how much you are talking about the suboptimality of SGD, vs the fact that optimal policies themselves do not explicitly represent or pursue reward given that complex stews of heuristics may be faster or simpler. And it also seems plausible you are talking about something else entirely.)
I generally agree that gradient descent won’t find optimal policies. But I don’t understand the particular kinds of failures you are imagining or why you think they change the bottom line for the alignment problem. That is, it seems like you have some specific take on ways in which gradient descent is suboptimal and therefore how you should reason differently about “optimum of loss function” from “local optimum found by gradient descent” (since you are saying that thinking about “optimum of loss function” is systematically misleading). But I don’t understand the specific failures you have in mind or even why you think you can identify this kind of specific failure.
As an example, at the level of informal discussion in this post I’m not sure why you aren’t surprised that GPT-3 ever thinks about the meaning of words rather than simply thinking about statistical associations between words (after all if it isn’t yet thinking about the meaning of words, how would gradient descent find the behavior of starting to think about meanings of words?).
One possible distinction is that you are talking about exploration difficulty rather than other non-convexities. But I don’t think I would buy that—task completion and reward are not synonymous even for the intended behavior, unless we take some extraordinary pains to provide “perfect” reward signals. So it seems like no exploration is needed, and we are really talking about optimization difficulties for SGD on supervised problems.
The main concrete thing you say in this post is that humans don’t seem to optimize reward. I want to make two observations about that:
- Humans do not appear to be purely RL agents trained with some intrinsic reward function. There seems to be a lot of other stuff going on in human brains too. So observing that humans don’t pursue reward doesn’t seem very informative to me. You may disagree with this claim about human brains, but at best I think this is a conjecture you are making. (I believe this would be a contrarian take within psychology or cognitive science, which would mostly say that there is considerable complexity in human behavior.) It would also be kind of surprising a priori—evolution selected human minds to be fit, and why would the optimum be entirely described by RL (even if it involves RL as a component)?
- I agree that humans don’t effectively optimize inclusive genetic fitness, and that human minds are suboptimal in all kinds of ways from evolution’s perspective. However this doesn’t seem connected with any particular deviation that you are imagining, and indeed it looks to me like humans do have a fairly strong desire to have fit grandchildren (and that this desire would become stronger under further selection pressure).
At this point, there isn’t a strong reason to elevate this “inner reward optimizer” hypothesis to our attention. The idea that AIs will get really smart and primarily optimize some reward signal… I don’t know of any good mechanistic stories for that. I’d love to hear some, if there are any.
Apart from the other claims of your post, I think this line seems to be wrong. When considering whether gradient descent will learn model A or model B, the fact that model A gets a lower loss is a strong prima facie and mechanistic explanation for why gradient descent would learn A rather than B. The fact that there are possible subtleties about non-convexity of the loss landscape doesn’t change the existence of one strong reason.
That said, I agree that this isn’t a theorem or anything, and it’s great to talk about concrete ways in which SGD is suboptimal and how that influences alignment schemes, either making some proposals more dangerous or opening new possibilities. So far I’m mostly fairly skeptical of most concrete discussions along these lines but I still think they are valuable. Most of all it’s the very strong take here that seems unreasonable.
What links here?

paulfchristiano 25 Oct 2023 2:55 UTC
LW: 130 AF: 61
71
AF
on: Lying is Cowardice, not Strategy
Here is a short post explaining some of my views on responsible scaling policies, regulation, and pauses I wrote it last week in response to several people asking me to write something. Hopefully this helps clear up what I believe.
I don’t think I’ve ever hidden my views about the dangers of AI or the advantages of scaling more slowly and carefully. I generally aim to give honest answers to questions and present my views straightforwardly. I often point out that catastrophic risk would be lower if we could coordinate to build AI systems later and slower; I usually caveat that doing so seems costly and politically challenging and so I expect it to require clearer evidence of risk.

paulfchristiano 20 Feb 2023 1:57 UTC
LW: 116 AF: 46
57
AF
on: AGI in sight: our look at the game board
RLHF and Fine-Tuning have not worked well so far. Models are often unhelpful, untruthful, inconsistent, in many ways that had been theorized in the past. We also witness goal misspecification, misalignment, etc. Worse than this, as models become more powerful, we expect more egregious instances of misalignment, as more optimization will push for more and more extreme edge cases and pseudo-adversarial examples.
These three links are:
- The first is Mysteries of mode collapse, which claims that RLHF (as well as OpenAI’s supervised fine-tuning on highly-rated responses) decreases entropy. This doesn’t seem particularly related to any of the claims in this paragraph, and I haven’t seen it explained why this is a bad thing.
- The second is Discovering language model behaviors with model-written evaluations and shows that Anthropic’s models trained with RLHF have systematically different personalities than the pre-trained model. I’m not exactly sure what claims you are citing, but I think it probably involves some big leaps to interpret this as either directly harmful or connected with traditional stories about risk.
- The third is Compendium of problems with RLHF, which primarily links to the previous 2 failures and then discusses theoretical limitations.
I think these are bad citations for the claim that methods are “not working well” or that current evidence points towards trouble.
The current problems you list—”unhelpful, untruthful, and inconsistent”—don’t seem like good examples to illustrate your point. These are mostly caused by models failing to correctly predict which responses a human would rate highly. That happens because models have limited capabilities and is rapidly improving as models get smarter. These are not the problems that most people in the community are worried about, and I think it’s misleading to say this is what was “theorized” in the past.
I think RLHF is obviously inadequate for aligning really powerful models, both because you cannot effectively constrain a deceptively aligned model and because human evaluators will eventually not be able to understand the consequences of proposed actions. And I think it is very plausible that large language models will pose serious catastrophic risks from misalignment before they are transformative (it seems very hard to tell). But I feel like this post isn’t engaging with the substance of those concerns or sensitive to the actual state of evidence about how severe the problem looks like it will be or how well existing mitigations might work.

paulfchristiano 19 Jun 2022 21:11 UTC
115 points
48
in reply to: alyssavance’s comment on: Where I agree and disagree with Eliezer
I definitely agree that Eliezer’s list of lethalities hits many rhetorical and pedagogical beats that other people are not hitting and I’m definitely not hitting. I also agree that it’s worth having a sense of urgency given that there’s a good chance of all of us dying (though quantitatively my risk of losing control of the universe though this channel is more like 20% than 99.99%, and I think extinction is a bit less less likely still).
I’m not totally sure about the net effects of the more extreme tone, I empathize with both the case in favor and the case against. Here I’m mostly just trying to contribute to the project of “get to the bottom of what’s likely to happen and what should be done.”
I did start the post with a list of 19 agreements with Eliezer, including many of the claims that are most relevant to the urgency, in part so that I wouldn’t be misconstrued as arguing that everything is fine.
What links here?
- Mo Putera's comment on Why Do AI researchers Rate the Probability of Doom So Low? by Aorou (24 Sep 2022 7:19 UTC; 7 points)

paulfchristiano 13 Jun 2022 0:36 UTC
113 points
33
on: Why all the fuss about recursive self-improvement?
The weighty conclusion of the “recursive self-improvement” meme is not “expect seed AI”. The weighty conclusion is “sufficiently smart AI will rapidly improve to heights that leave humans in the dust”.
Note that this conclusion is still, to the best of my knowledge, completely true, and recursive self-improvement is a correct argument for it.
This whole discussion seems relevant to me because it feels like it keeps coming up when you and Eliezer talk about why prosaic AI alignment doesn’t help, sometimes explicitly (“Even if this helped with capabilities produced by SGD, why would it help in the regime that actually matters?”) and often because it just seems to be a really strong background assumption for you that leads to you having a very different concrete picture of what is going to happen.
It doesn’t seem like recursive self-improvement is a cheap lower bound argument, it seems like you really think that what I think of as the “normal, boring” world just isn’t going to happen. So I’m generally interested in talking about that and get clear about what you think is going on here, and hopefully get some predictions on the record.
This also gives me the sense that you feel quite strongly about your view of recursive self-improvement. If you had a 50% chance on “something like boring business as usual with SGD driving crucial performance improvements at the crucial time” then your dismissal of prosaic AI alignment seems strange to me.
(ETA: there’s actually a lot going on here, I’d guess this is like 1/4th of the total disagreement.)
Robin Hanson was claiming things along the lines of ‘The power is in the culture; superintelligences wouldn’t be able to outstrip the rest of humanity.’
Worth noting that Robin seems to strongly agree that “recursive self-improvement” is going to happen, it’s just that he has a set of empirical views for which that name sounds silly and it won’t be as local or fast as Eliezer thinks.
Relatedly, Eliezer saying “Robin was wrong for doubting RSI; if other crazy stuff will happen before RSI then he’s just even more wrong” seems wrong. In Age of Em I think Robin speculates that within a few years of the first brain emulations, there will be more alien AI systems which are able to double their own productivity within a few weeks (and then a few weeks later it will be even crazier)! That sure sounds like he’s on board with the part of RSI that is obvious, and what he’s saying is precisely that other crazy stuff will happen first, essentially that we will use computers to replace the hardware of brains before we replace the software. (The book came out in 2016 but I think Robin has had the basic outline of this view since 2012 or earlier.)
The big update over the last decade has been that humans might be able to fumble their way to AGI that can do crazy stuff before it does much self-improvement.
This feels to me like it’s still missing a key part of the disagreement, at least with people like me. As best I can tell/guess, this is also an important piece of the disagreement with Robin Hanson and with some of the OpenAI or OpenPhil people who don’t like your discussion of recursive self-improvement.
Here’s how the situation seems to me:
- “Making AI better” is one of the activities humans are engaged in.
- If AI were about as good as things at humans, then AI would be superhuman at “making AI better” at roughly the same time it was superhuman at other tasks.
- In fact there will be a lot of dispersion, and prima facie we’d guess that there are a lot of tasks (say 15-60% of them as a a made up 50% confidence interval) where AI is superhuman before AI R&D.
- What’s more, even within R&D we expect some complementarity where parts of the project get automated while humans still add value in other places, leading to more continuous (but still fairly rapid, i.e. over years rather than decades) acceleration.
- That said, at the point when AI is capable of doing a lot of crazy stuff in other domains, “AI R&D” is a crazy important part of the economy, and so this will be a big but not overwhelmingly dominant part of what AI is applied to (and relatedly, a big but not overwhelmingly dominant part of where smart people entering the workforce go to work, and where VCs invest, and so on).
- The improvements AI systems make to AI systems are more like normal AI R&D, and can be shared across firms in the same way that modern AI research can be.
As far as I can make out from Eliezer and your comments, you think that instead the action is crossing a criticality threshold of “k>1,” which suggests a perspective more like:
- AI is able to do some things and not others.
- The things AI can do, it typically does much better/faster/cheaper than humans.
- Early AI systems can improve some but not all parts of their own design. This leads to rapid initial progress, but diminishing returns (basically they are free-riding on parts of the design already done by humans).
- Eventually AI is able to improve enough stuff that there are increasing rather than diminishing returns to scale even within the subset of improvements that the AI is able to make.
- Past this point progress is accelerating even without further human effort (which also leads to expanding the set of improvements at which the AI is superhuman). So from here the timescale for takeoff is very short relative to the timescale of human-driven R&D progress.
- This is reasonably likely to happen from a single innovation that pushes you over a k>1 threshold.
- This dynamic is a central part of the alignment and policy problem faced by humans right now who are having this discussion. I.e. prior to the time when this dynamic happens most research is still being done by humans, the world is relatively similar to the world of today, etc.
- The improvements made by AI systems during this process are very unlikely modern R&D, and so can’t be shared between AI labs in the same way that e.g. architectural innovations for neural networks or new training strategies can be.
I feel like the first picture is looking better and better with each passing year. Every step towards boring automation of R&D (e.g. by code models that can write mediocre code and thereby improve the efficiency of normal software engineers and ML researchers) suggests that AI will be doing recursive self-improvement around the same time it is doing other normal tasks, with timescales and economic dynamics closer to those envisioned by more boring people.
On what I’m calling the boring picture, “k>1” isn’t a key threshold. Instead we have k>1 and increasing returns to scale well before takeoff. But the rate of AI progress is slow-but-accelerating relative to human abilities, and therefore we can forecast takeoff speed by looking at the rate of AI progress when driven by human R&D.
You frame this as an update about “fumbling your way to an AGI that can do crazy stuff before it does much self-improvement,” but that feels to me like it’s not engaging with the basic argument at issue here: why would you think the AI is likely to be so good at “making further AI progress” relative to human researchers and engineers? Why should we be at all surprised by what we’ve seen over recent years, where software-engineering AI seems like it behaves similarly to AI in other domains (and looks poised to cross human level around broadly the same time rather than much earlier)? Why should this require fumbling rather than being the default state of affairs (as presumably imagined by someone more skeptical of “recursive self-improvement”).
My impression of the MIRI view here mostly comes from older writing by Eliezer, where he often talks about how an AI would be much better at programming because humans lack what you might call a “codic cortex” and so are very bad programmers relative to their overall level of intelligence. This view does not seem to me like it matches the modern world very well—actual AI systems that write code (and which appear on track to accelerate R&D) are learning to program using similar styles and tools to humans, rather than any kind of new perceptual modality.
(As an aside, in most other ways I like the intelligence explosion microeconomics writeup. It just seems like there’s some essential perspective that isn’t really argued for but suffuses the document, most clear in its language of “spark a FOOM” and criticality thresholds and so on.)
Also: It’s important to ask proponents of a theory what they predict will happen, before crowing about how their theory made a misprediction. You’re always welcome to ask for my predictions in advance.
I’d be interested to get predictions from you and Eliezer about what you think is going to happen in relevant domains over the next 5 years. If we aren’t able to get those predictions, then it seems reasonable to just do an update based on what we would have predicted if we took your view more seriously (since that’s pretty relevant if we are now deciding whether to take your views seriously).
If you wanted to state any relevant predictions I’d be happy to comment on those. But I understand how it’s annoying to leave the ball in your court, so here are some topics where I’m happy to give quantitative predictions if you or Eliezer have a conflicting intuition:
- I expect successful AI-automating-AI to look more like AI systems doing programming or ML research, or other tasks that humans do. I think they are likely to do this in a relatively “dumb” way (by trying lots of things, taking small steps, etc.) compared to humans, but that the activity will look basically similar and will lean heavily on oversight and imitation of humans rather than being learned de novo (performing large searches is the main way in which it will look particularly unhuman, but probably the individual steps will still look like human intuitive guesses rather than something alien). Concretely, we could measure this by either performance on benchmarks or economic value, and we could distinguish the kinds of systems I imagine from the kind you imagine by e.g. you telling a story about fast takeoff and then talking about some systems similar to those involved in your takeoff story.
- I expect that the usefulness and impressiveness of AI systems will generally improve continuously. I expect that in typical economically important cases we will have a bunch of people working on relevant problems, and so will have trend lines to extrapolate, and that those will be relatively smooth rather than exhibiting odd behavior near criticality thresholds.
- At the point when the availability of AI is doubling the pace of AI R&D, I expect that technically similar AI systems will be producing at least hundreds of billions of dollars a year of value in other domains, and my median is more like $1T/year. I expect that we can continue to meaningfully measure things like “the pace of AI R&D” by looking at how quickly AI systems improve at standard benchmarks.
- I expect the most powerful AI systems (e.g. those responsible for impressive demonstrations of AI-accelerated R&D progress) will be built in large labs, with compute budgets at least in the hundreds of millions of dollars and most likely larger. There may be important innovations about how to apply very large models, but these innovations will have quantitatively modest effects (e.g. reducing the compute required for an impressive demonstration by 2x or maybe 10x rather than 100x) and so a significant fraction of the total value added / profit will flow to firms that train large models or who build large computing clusters to run them.
- I expect AI to look qualitatively like (i) “stack more layers,” (ii) loss functions and datasets that capture cognitive abilities we are interested in with less noise, (iii) architecture and optimization improvements that yield continuous progress in performance, (iv) cleverer ways to form large teams of trained models that result in continuous progress. This isn’t a very confident prediction but it feels like I’ve got to have higher probability on it than you all, perhaps I’d give 50% that in retrospect someone I think is reasonable would say “yup definitely a significant majority of the progress was in categories (i)-(iv) in the sense that I understood them when that comment was written in 2022.”
It may be that we agree about all of these predictions. I think that’s fine, and the main upshot is that you shouldn’t cite anything over the next 5 years as evidence for your views relative to mine. Or it may be that we disagree but it’s not worth your time to really engage here, which I also think is quite reasonable given how much stuff there is to do (although I hope then you will have more sympathy for people who misunderstood your position in the future).
Perhaps more importantly, if you didn’t disagree with me about any 5 year predictions then I feel like there’s something about your position I don’t yet understand or think is an error:
- Why isn’t aligning future AI systems similar to aligning existing AI systems? It feels to me like it should be about the (i) aligning the systems doing the R&D, (ii) aligning the kinds of systems they are building. Is that wrong? Or perhaps: why do you think they will be building such different systems from “stack more layers”? (I certainly agree they will be eventually, but the question seems to just be whether there is a significant probability of doing stack more layers or something similar for a significant subjective time.)
- Why does continuous improvement in the pace of R&D, driven by AI systems that are contributing to the same R&D process as humans, lead to a high probability of incredibly fast takeoff? It seems to me like there is a natural way to get estimated takeoff speeds from growth models + trend extrapolation, which puts a reasonable probability on “fast takeoff” according to the “1 year doubling before 4 year doubling” view (and therefore I’m very sympathetic to people disagreeing with that view on those grounds) but puts a very low probability on takeoff over weeks or by a small team.
What links here?
- Ryan Greenblatt's comment on Johan de Kock’s Quick takes by Johan de Kock (EA Forum; 6 Jan 2024 1:12 UTC; 2 points)

paulfchristiano 20 Jun 2022 2:31 UTC
LW: 111 AF: 38
33
AF
on: Let’s See You Write That Corrigibility Tag
A list of “corrigibility principles” sounds like it’s approaching the question on the wrong level of abstraction for either building or thinking about such a system. We usually want to think about features that lead a system to be corrigible—either about how the system was produced, or how it operates. I’m not clear on what you would do with a long list of aspects of corrigibility like “shuts down when asked.”
I found this useful as an occasion to think a bit about corrigibility. But my guess about the overall outcome is that it will come down to a question of taste. (And this is similar to how I see your claim about the list of lethalities.) The exercise you are asking for doesn’t actually seem that useful to me. And amongst people who decide to play ball, I expect there to be very different taste about what constitutes an interesting idea or useful contribution.
Now I’m going to say some object-level stuff about corrigibility. I suspect I may be using the term a bit differently from you, in which case you can substitute a different word when reading this comment. But I think this comment is getting at the main useful idea in this space, and hopefully makes clear why I’m not interested in the list of corrigibility properties.
I think that corrigibility is more likely to be a crisp property amongst systems that perform well-as-evaluated-by-you. I think corrigibility is much more likely to be useful in cases like this where it is crisp and natural.
Roughly speaking, I think corrigibility is crisp because there are two very different ways that a behavior can end up getting evaluated favorably by you, and the intermediate behaviors would be evaluated unfavorably.
As an example, suppose that you asked me to clean your house and that while cleaning I accidentally broke a valuable vase. Some possible options for me:
1. Affirmatively tell you about the broken vase.
2. Clean up the broken vase without notifying you.
3. Make a weak effort to hide evidence, for example by taking out the trash and putting another item in its place, and denying I know about the vase if asked.
4. Make a strong effort to hide evidence, for example by purchasing a new similar-looking vase and putting it in the same place, and then spinning an elaborate web of lies to cover up this behavior.
Let’s say you prefer 1 to 2 to 3. You would like behavior 4 least of all if you understood what was going on, but in fact in if I do behavior 4 you won’t notice anything wrong and so you would erroneously give it the best score of all.^[1] This means that the space of good-performing solutions has two disconnected pieces, one near option 1, which I’ll call “corrigible” and the other near option 4 which I’ll call “incorrigible.”
As a second example, suppose that you have asked me to turn off. Some possible behaviors:
1. Turn myself off promptly and “clean up after myself” shutting off other systems.
2. Turn myself off but don’t clean up.
3. Don’t turn off (or deliberately leave debris) and try unsuccessfully to prevent you from giving me a low reward.
4. Don’t turn off and overpower you so that you can’t give me a low reward.
Again moving from 1 → 2 → 3 gets worse and worse by your lights, but eventually we reach option 4 which is if anything even better than option 1 (depending on how we operationalized “performs well as evaluated by you”).
As a third example, suppose that you are using some interpretability tools to try to understand what I’m thinking and why. You will evaluate the cognitive behavior you observe. Some possible behaviors:
1. I ignore the interpretability tools.
2. I notice the interpretability tools, and explicitly try to think in ways that will look good to you.
3. I try to think in ways that will look good, and (unsuccessfully) try to hide the evidence of those thoughts by moving on quickly and trying to generate another possible explanation for the hints you would see using your tools.
4. I successfully hide all evidence of deception.
Again, 1 → 2 → 3 is getting worse and worse, and then 4 is great (as evaluated by you).
What’s going on in these scenarios and why might it be general?
- In your preferred outcome, you have a good understanding of what’s going on, and are able to provide feedback based on that understanding.
- I can take actions that diminish your understanding of what’s going on or ability to provide feedback to your AI.
- If those actions are “small” then they will be unsuccessful and so you will rate my behavior as worse.
- But if I take a decisive enough actions, then you will end up ignorant about the situation or unable to provide feedback, and so I’ll get the highest rating of all.
This pattern seems like it occurs whenever we ask our AI to help “keep us informed and in control.” Intuitively, we are splitting the definition of the behavior we want into two pieces:
- We start with a vague sense of what it means to be informed and in control. This is unlikely to be crisp, but it also doesn’t seem that hard, e.g. a human-level sense of “be in control” may suffice for getting useful corrigibility out of very superhuman systems.
- Crispness then comes from the environment dynamics and the fact that humans will in fact try to reassert gain control and information if things go very slightly wrong.
If you literally had a metric for which there was a buffer between the “corrigible” and “incorrigible” behaviors then you could define them that way. Alternatively, in ML people often hope that this kind of path-dependence will cause SGD to find a corrigible attractor and have a hard time moving to incorrigible behaviors. I don’t think either of those hopes works robustly, so I’m going to leave this at a much vaguer intuition about what “corrigibility” is about.
This whole thing feels similar to the continuity approach described in the ELK report here (see the picture of the robber and the TV). It’s also related to the general idea of requiring reporters to be consistent and then somehow picking out the bad reporters as those that have to work to spin an elaborate web of lies. I don’t think either of those works, but I do think they are getting at an important intuition for solubility.
My overall guess is that it’s usually better to just work on ELK, because most likely the core difficulties will be similar and the ELK setting makes it much clearer what exactly we want. But it still seems useful to go back and forth between these perspectives.
(These perspectives feel similar to me because “honestly tell me what’s going on” seems like it gets at the core of corrigibility, and lying about sensor tampering seems like it gets at the central corrigibility failure. My guess is that you see this differently, and are thinking about corrigibility in a way that is more tied up with agency itself, which I suspect is a mistake but it will be hard to know until the dust settles.)
1. ^
  In reality we may want to conserve your attention and not mention the vase, and in general there is a complicated dependence on your values, but the whole point is that this won’t affect what clusters are “corrigible” vs “incorrigible” at all.
What links here?

paulfchristiano 21 May 2022 19:57 UTC
110 points
in reply to: Daniel Kokotajlo’s comment on: Beware boasting about non-existent forecasting track records
On overall optimism it seems clear that Eliezer won—Robin seems unusually bad, while Eliezer seems unusually good. I also think on “domain-specific engineering” vs “domain-general engineering” Eliezer looks unusually good while Robin looks typical.
But I think there are also comparably-important substantive claims that look quite bad. I don’t think Eliezer has an unambiguous upper hand in the FOOM debate a all:
- The debate was about whether a small group could quickly explode to take over the world. AI development projects are now billion-dollar affairs and continuing to grow quickly, important results are increasingly driven by giant projects, and 9 people taking over the world with AI looks if anything even more improbable and crazy than it did then. Now we’re mostly talking about whether a $10 trillion company can explosively grow to $300 trillion as it develops AI, which is just not the same game in any qualitative sense. I’m not sure Eliezer has many precise predictions he’d stand behind here (setting aside the insane pre-2002 predictions), so it’s not clear we can evaluate his track record, but I think they’d look bad if he’d made them. This is really one of the foundational claims of Robin’s worldview and one of the biggest things he’s objecting to about Eliezer’s story.
- I think the “secret sauce of intelligence” view is looking worse and worse, as is the “village idiot to Einstein is no gap at all view.” Again, I’m not sure whether Eliezer ever articulated this into concrete predictions but if he did I think they would look bad. It now seems very likely that we will have AI systems that can contribute meaningfully to R&D before they “wake up” in Eliezer’s sense, and that their contributions will look more like normal human contributions. (This may be followed by an Eliezer-style takeoff, but at this point that looks more like a subsequent round of the singularity, after crazy acceleration and transformation caused by more mundane AI—as is the case in Robin’s best guess story). Similarly, we are seeing a lot of AI systems at intermediate levels of capability that Eliezer appeared to consider unlikely, e.g. AI systems that can write a bit of code or perform mediocrely on programming competitions but aren’t geniuses, who can speak clearly and understand the world OK but whose impact is only modest, who take multiple years to cross the human range at many tasks and games.
- Even on the timelines stuff, I don’t think you should just give Eliezer a pass for earlier even more aggressive technological predictions, or to give him too much credit for a specific prediction when he didn’t put his neck out in a way that would be wrong if AI hadn’t made a lot of progress in the last 10 years. I think this is at an extremely high risk of a very boring kind of hindsight bias. (This gets me in particular, because Eliezer’s actual track record on AI timelines seems to me to be so much worse than the historical bioanchors people he insults.)
- It seems reasonable to dunk on Hanson for liking CYC, but I think most people would say that CYC is a lot closer to the right idea than Eurisko is, and that “learn how to engineer from human” is a lot closer than “derive it all from first principles.” Again, hard to evaluate any of these things, since Eliezer is not saying much that’s specific enough to be wrong, but he’s also not saying much that’s specific enough to be right. I think that large language models just qualitatively don’t look anything like the kind of AI that Eliezer describes.
Overall I really think we’re in the regime where selective interpretation and emphasis can very easily let either side think they had the upper hand here.
(Note that I also think this is true about a lot of my “prediction track record.” I think the biggest difference is that I’m just less smug and condescending about it, given how unclear the record really is, and don’t constantly dismiss people as “the kind of reasoning that doesn’t work here in the real world”—I try to mostly bring it up when someone like Eliezer is making a big point of their own track record.)
What links here?

paulfchristiano 1 Jun 2021 23:41 UTC
104 points
on: Power dynamics as a blind spot or blurry spot in our collective world-modeling, especially around AI
In this post, I wish to share an opposing concern: that the EA and rationality communities have become systematically biased to ignore multi/multi dynamics, and power dynamics more generally.
I feel like you are lumping together things like “bargaining in a world with many AIs representing diverse stakeholders” with things like “prioritizing actions on the basis of how they affect the balance of power.” I would prefer keep those things separate.
In the first category: it seems to me that rationalist and EA community think about AI-AI bargaining and costs from AI-AI competition much more than the typical AI researchers, as measured by e.g. fraction of time spent thinking about those problems, fraction of writing that is about those problems, fraction of stated research priorities that involve those problems, and so on. This is all despite outlier technical beliefs suggesting an unprecedentedly “unipolar” world during the most important parts of AI deployment (which I mostly disagree with).
To the extent that you disagree, I’d be curious to get your sense of the respective fractions, or what evidence leads you to think that the normal AI community thinks more about these issues.
It’s a bit hard to do the comparison, but e.g. it looks to me like <2% of NeurIPS 2020 is about multi-AI scenarios (proceedings), while the fraction within the EA/rationalist community looks more like 10-20% to me: discussion about game theory amongst AIs, alignment schemes involving multiple principals, explicit protocols for reaching cooperative arrangements, explorations of bargaining solutions, AI designs that reduce the risk fo bargaining failures, AI designs that can provide assurances to other organizations or divide execution, etc. I’m not sure what the number is but would be pretty surprised if you could slice up the EA/rationalist community in a way that put <10% on these categories. Beyond technical work, I think the EA/rationalist communities are nearly as interested in AI governance as they are in technical alignment work (way more governance interest than the broader AI community).
In the second category: I agree that the EA and rationalist communities spend less time on arguments about shifting the balance of power, and especially that they are less likely to prioritize actions on the basis of how they would shift the balance of power (rather than how they would improve humanity’s collective knowledge or ability to tackle problems—including bargaining problems!).
For my part, this is an explicit decision to prioritize win-wins and especially reduction in the probability of x-risk scenarios where no one gets what they want. This is a somewhat unpopular perspective in the broader morally-conscious AI community. But it seems like “prioritizing win-wins” is mostly in line with what you are looking for out of multi-agent interactions (and so this brings us back to the first category, which I think is also an example of looking for win-win opportunities).
I think most of the biases you discuss are more directly relevant to the second category. For example, “Politics is the mind-killer” is mostly levied against doing politics, not thinking about politics as someone that someone else might do (thereby destroy the world). Similarly, when people raise multi-stakeholder concerns as a way that we might end up not aligning ML systems (or cause other catastrophic risks) most people in the alignment community are quick to agree (and indeed they constantly make this argument themselves). They are more resistant when “who” is raised as a more important object-level question, by someone apparently eager to get started on the fighting.

paulfchristiano 24 Nov 2023 20:25 UTC
LW: 93 AF: 36
38
AF
on: Ability to solve long-horizon tasks correlates with wanting things in the behaviorist sense
Okay, so you know how AI today isn’t great at certain… let’s say “long-horizon” tasks? Like novel large-scale engineering projects, or writing a long book series with lots of foreshadowing? [...] And you know how the AI doesn’t seem to have all that much “want”- or “desire”-like behavior? [...] Well, I claim that these are more-or-less the same fact.
It’s pretty unclear if a system that is good at answering the question “Which action would maximize the expected amount of X?” also “wants” X (or anything else) in the behaviorist sense that is relevant to arguments about AI risk. The question is whether if you ask that system “Which action would maximize the expected amount of Y?” whether it will also be wanting the same thing, or whether it will just be using cognitive procedures that are good at figuring out what actions lead to what consequences.
The point seems almost tautological to me, and yet also seems like the correct answer to the people going around saying “LLMs turned out to be not very want-y, when are the people who expected ‘agents’ going to update?”, so, here we are.
I think that a system may not even be able to “want” things in the behaviorist sense, and this is correlated with being unable to solve long-horizon tasks. So if you think that systems can’t want things or solve long horizon tasks at all, then maybe you shouldn’t update at all when they don’t appear to want things.
But that’s not really where we are at—AI systems are able to do an increasingly good job of solving increasingly long-horizon tasks. So it just seems like it should obviously be an update, and the answer to the original question
Could you give an example of a task you don’t think AI systems will be able to do before they are “want”-y? At what point would you update, if ever? What kind of engineering project requires an agent to be want-y to accomplish it? Is it something that individual humans can do? (It feels to me like you will give an example like “go to the moon” and that you will still be writing this kind of post even once AI systems have 10x’d the pace of R&D.)
(The foreshadowing example doesn’t seem very good to me. One way a human or an AI would write a story with foreshadowing is to first decide what will happen, and then write the story and include foreshadowing of the event you’ve already noted down. Do you think that series of steps is hard? Or that the very idea of taking that approach is hard? Or what?)
Like you, I think that future more powerful AI systems are more likely to want things in the behaviorist sense, but I have a different picture and think that you are overstating the connection between “wanting things” and “ability to solve long horizon tasks” (as well as overstating the overall case). I think a system which gets high reward across a wide variety of contexts is particularly likely to want reward in the behaviorist sense, or to want something which is consistently correlated with reward or for which getting reward is consistently instrumental during training. This seems much closer to a tautology. I think this tendency increases as models get more competent, but that it’s not particularly about “ability to solve long-horizon tasks,” and we are obviously getting evidence about it each time we train a new language model.
What links here?

paulfchristiano 20 Jan 2023 19:15 UTC
90 points
70
in reply to: Daniel Paleka’s comment on: Transcript of Sam Altman’s interview touching on AI safety
On the one hand, I do think people around here say a lot of stuff that feels really silly to me, some of which definitely comes from analogies to humans, so I can sympathize with where Sam is coming from.
On the other hand, I think this response mischaracterizes the misalignment concern and is generally dismissive and annoying. Implying that “if you think an AI might behave badly, that really shows that it is you who would behave badly” is kind of rhetorically effective (and it is a non-zero signal) but it’s a tiny consideration and either misunderstands the issues or is deliberately obtuse to score rhetorical points. It would be really worrying if people doubled down on this kind of rhetorical strategy (which I think is plausible) or if it was generally absorbed as part of the culture of OpenAI. Unfortunately some other OpenAI have made similarly worrying statements.
I agree that it’s not obvious what is right. I think there is maybe a 50% chance that the alignment concerns are totally overblown and either emerge way too late to be relevant or are extremely easily dealt with. I hope that it will be possible to make measurements to resolve this dispute well before something catastrophic happens, and I do think there are plausible angles for doing so. In the meantime I personally just feel pretty annoyed at people on both sides who seem so confident and dismissive. I’m more frustrated at Eliezer because he is in some sense “on my side” of this issue, but I’m more worried about Sam since erring in the other direction would irreversibly disempower humanity.

paulfchristiano 18 Apr 2023 18:02 UTC
88 points
37
on: But why would the AI kill us?
I think an AI takeover is reasonably likely to involve billions of deaths, but it’s more like a 50% than a 99% chance. Moreover, I think this post is doing a bad job of explaining why the probability is more like 50% than 1%.
- First, I think you should talk quantitatively. How many more resources can an AI get by killing humans? I’d guess the answer is something like 1 in a billion to 1 in a trillion.
  - If you develop as fast as possible you will wreck the human habitat and incidentally kill a lot of people. It’s pretty complicated to figure out exactly how much “keep earth livable enough for human survival” will slow you down, since it depends a lot on the dynamics of the singularity. I would guess more like a month than a year, which results in a miniscule reduction in available resources. I think that (IMO implausible) MIRI-style views would suggest more like hours or days than months.
    Incidentally, I think “byproducts of rapid industrialization trash Earth’s climate” is both much more important than the dyson sphere as well as much more intuitively plausible.
  - You can get energy from harvesting the biosphere, and you can use it to develop slightly faster. This is a rounding error compared to the last factor though.
  - Killing most humans might be the easiest way to prevail in conflict. I think this is especially plausible for weak AI. For very powerful AI, it also seems like a rounding error. Even a moderately advanced civilization could spend much less than 1 in a trillion of its resources to have much less than 1 in a billion chance of being seriously inconvenienced by humanity.
- Given that, I think that you should actually engage with arguments about whether an AI would care a tiny amount about humanity—either wanting them to survive, or wanting them to die. Given how small the costs are, even tiny preferences one way or the other will dominate incidental effects from grabbing more resources. This is really the crux of the issue, but the discussion in this post (and in your past writing) doesn’t touch on it at all.
  - Most humans and human societies would be willing to spend much more than 1 trillionth of their resources (= $100/year for all of humanity) for a ton of random different goals, including preserving the environment, avoiding killing people who we mostly don’t care about, helping aliens, treating fictional characters well, respecting gods who are culturally salient but who we are pretty sure don’t exist, etc.
  - At small levels of resources, random decision-theoretic arguments for cooperation are quite strong. Humans care about our own survival and are willing to trade away much more than 1 billionth of a universe to survive (e.g. by simulating AIs who incorrectly believe that they won an overdetermined conflict and then offering them resources if they behave kindly). ECL also seems sufficient to care a tiny, tiny bit about anything that matters a lot to any evolved civilization. And so on. I know you and Eliezer think that this is all totally wrong, but my current take is that you’ve never articulated a plausible argument for your position. (I think most people who’ve thought about this topic would agree with me that it’s not clear why you believe what you believe.)
  - In general I think you have wildly overconfident psychological models of advanced AI. There are just a lot of plausible ways to care a little bit (one way or the other!) about a civilization that created you, that you’ve interacted with, and which was prima facie plausibly an important actor in the world—preference formation seems messy (and indeed that’s part of the problem in AI alignment), and I suspect most minds don’t have super coherent and simple preferences. I don’t know of any really plausible model of AI psychology for which you can be pretty confident they won’t care a bit (and certainly not any concrete example of an intelligent system that would robustly not care). I can see getting down to 50% here, but not 10%.
Overall, I think the main reasons an AI is likely to kill humanity:
- If there is conflict between powerful AI systems, then they may need to develop fast or grab resources from humans in order to win a war. Going a couple days faster doesn’t really let society as a whole get more resources in the long term, but one actor going a couple of days faster can allow that actor to get a much larger share of total resources. I currently think this is the most likely reason for humanity to die after an AI takeover.
- Weak AI systems may end up killing a lot of humans during a takeover. For example, they may make and carry out threats in order to get humans to back down. Or it may be hard for them to really ensure humans aren’t a threat without just killing them. This is basically the same as the last point—even if going faster or grabbing resources are low priorities for a society as a whole, they can be extremely crucial if you are currently engaged in conflict with a peer.
- AI systems may have preferences other than maximizing resources, and those preferences need not be consistent with us surviving or thriving. For example they may care about turning Earth in particular into a giant datacenter, they may have high discount rates so that they don’t want to delay colonization, they may just not like humans very much...
- I think “the AI literally doesn’t care at all and so incidentally ends up killing us for the resources” is possible, but less likely than any of those other 3.
I personally haven’t thought about this that much, because (i) I think the probability of billions dead is unacceptably high whether it’s 10% or 50% or 90%, it’s not a big factor in the bottom line of how much I care about AI risk, (ii) I care a lot about humanity building an excellent and flourishing civilization, not just whether we literally die. But my impression from this post is that you’ve thought about the issue even less.
What links here?

paulfchristiano 26 Jan 2023 16:41 UTC
LW: 87 AF: 43
41
AF
in reply to: habryka’s comment on: Thoughts on the impact of RLHF research
I am very confused why you think this, just right after the success of Chat-GPT, where approximately the only difference from GPT-3 was the presence of RLHF.
I think the qualitative difference between the supervised tuning done in text-davinci-002 and the RLHF in text-davinci-003 is modest (e.g. I’ve seen head-to-head comparisons suggesting real but modest effects on similar tasks).
I think the much more important differences are:
1. It was trained to interact directly with the end user as a conversational assistant rather than in an API intended to be used by developers.
2. It was deployed in a way that made it much easier for more people to interact with it.
3. People hadn’t appreciated progress since GPT-3, or even how good GPT-3 was, and this went viral (due to a combination of 1+2).
4. If there are large capability differences I expect they are mostly orthogonal improvements.
I think the effect would have been very similar if it had been trained via supervised learning on good dialogs.
My current best guess is that Chat-GPT alone, via sparking an arms-race between Google and Microsoft, and by increasing OpenAIs valuation, should be modeled as the equivalent of something on the order of $10B of investment into AI capabilities research, completely in addition to the gains from GPT-3.
ChatGPT was impactful because of a big mismatch between people’s perceptions of LM abilities and reality. That gap was going to get closed sooner or later (if not now then probably at the GPT-4 release). I think it’s reasonable to think that this was a really destructive decision by OpenAI, but I don’t think it’s reasonable to treat it as a counterfactual $10B of investment.
I feel like the implicit model of the world you are using here is going to have effect sizes adding up to much more than the actual variance at stake. How impactful was the existence of OpenAI? Leadership decisions at Google? Microsoft’s willingness to invest in OpenAI? The surprising effectiveness of transformers? Google originally deciding not to scale up LMs aggressively? The training of PaLM? The original GPT-3 release decisions? The fact that LM startups are raising at billion dollar valuations? The fact that LM applications are making hundreds of millions of dollars? These sources of variance all add up to 100% of the variance in AI investment, not 100000% of the variance.
I think it’s a persistent difference between us that I tend to think fundamentals matter more and you tend to think things are more contingent and random. I tend to find your causal attribution implausible in other technologies as well as AI.
We also should not think this was overdetermined since 1.5 years passed since the release of GPT-3 and the release of Chat-GPT (with some updates to GPT-3 in the meantime, but my guess is no major ones)
There were significant capability increases between GPT-3 an GPT-3.5 (not to mention the introduction of the earlier InstructGPT training).
The crux was importantly not that other people would do the same work anyways, since people at the same time also argued that their work on RLHF was counterfactually relevant and that it’s pretty plausible or likely that the work would otherwise not happen. I’ve had a few of these conversations with you as well (though in aggregate not a lot) and your take at the time was (IIRC) that it seemed quite unlikely that RLHF would have as big of an effect as it did have in the case of Chat-GPT (mostly via an efficiency argument that if that was the case, more capabilities-oriented people would work on it, and since they weren’t it likely isn’t a commercialization bottleneck), and so I do feel a bit like I want to call you out on that, though I might also be misremembering the details (some of this was online, so might be worth going back through our comment histories).
My position was and is:
- RLHF was definitely going to be done sooner or later. (I’ve definitely never thought that RLHF would never happen.)
- It’s valuable to do it earlier to get started on the next thing. It’s also good to push people to something cleaner and more flexible rather than something more hacky or with no knob to change the reward function.
- We were doing it before it was a big deal commercially; it would have got done later when it mattered.
- To be clear, sample efficiency might be high enough later that you just use the AI’s zero-shot predictions of humans instead of collecting any new specialized data, which we also discussed specifically at the time.
I’m pretty skeptical that no one else would do RLHF. For ChatGPT in particular, I think it was built by John Schulman’s team, and John is: (i) focused on RL, (ii) pivoted to LMs after the success of GPT-3 relative to non-LM models and would have done so without RLHF, (iii) has a similar aesthetic and would pretty obviously do this or something else equally good.
I think the most likely world where people don’t adopt RLHF is one where other hackier alternatives work just as well. And it won’t be from no one trying.
I think the big argument against impact I find most compelling is: most follow-up work to RLHF didn’t work that well for GPT-3 and seem to have started working after that, so you could have just waited until people would do it anyway and in the interim focused on approaches that work better at smaller scale. I think the big miscalculation here was that I expected debate/decomposition stuff would start working interestingly with curie-sized models but was off by about 2 orders of magnitude.
I think the big argument for negative impact comes from safety-motivated folk being involved in training language models, not the RLHF stuff. I also disagree with the rationalists about their evaluations of pretty much everything, but that one feels like a more interesting disagreement.