# Thoughts on the impact of RLHF research

In this post I’m going to describe my basic justification for working on RLHF in 2017-2020, which I still stand behind. I’ll discuss various arguments that RLHF research had an overall negative impact and explain why I don’t find them persuasive.

I’ll also clarify that I don’t think research on RLHF is automatically net positive; alignment research should address real alignment problems, and we should reject a vague association between “RLHF progress” and “alignment progress.”

## Background on my involvement in RLHF work

Here are some background views about alignment I held in 2015 and still hold today. I expect disagreements about RLHF will come down to disagreements about this background:

• The simplest plausible strategies for alignment involve humans (maybe with the assistance of AI systems) evaluating a model’s actions based on how much we expect to like their consequences, and then training the models to produce highly-evaluated actions. (This is in contrast with, for example, trying to formally specify the human utility function, or notions of corrigibility /​ low-impact /​ etc, in some way.)

• Simple versions of this approach are expected to run into difficulties, and potentially to be totally unworkable, because:

• Evaluating consequences is hard.

• A treacherous turn can cause trouble too quickly to detect or correct even if you are able to do so, and it’s challenging to evaluate treacherous turn probability at training time.

• It’s very unclear if those issues are fatal before or after AI systems are powerful enough to completely transform human society (and in particular the state of AI alignment). Even if they are fatal, many of the approaches to resolving them still have the same basic structure of learning from expensive evaluations of actions.

In order to overcome the fundamental difficulties with RLHF, I have long been interested in techniques like iterated amplification and adversarial training. However, prior to 2017 most researchers I talked to in ML (and many researchers in alignment) thought that the basic strategy of training AI with expensive human evaluations was impractical for more boring reasons and so weren’t interested in these difficulties. On top of that, we obviously weren’t able to actually implement anything more fancy than RLHF since all of these methods involve learning from expensive feedback. I worked on RLHF work to try to facilitate and motivate work on fixes.

The history of my involvement:

• My first post on this topic was in 2015.

• When I started full-time at OpenAI in 2017 it seemed to me like it would be an impactful project; I considered doing a version with synthetic human feedback (showing that we could learn from a practical amount of algorithmically-defined feedback) but my manager Dario Amodei convinced me it would be more compelling to immediately go for human feedback. The initial project was surprisingly successful and published here.

• I then intended to implement a version with language models aiming to be complete in the first half of 2018 (aiming to build an initial amplification prototype with LMs around end of 2018; both of these timelines were about 2.5x too optimistic). This seemed like the most important domain to study RLHF and alignment more broadly. In mid-2017 Alec Radford helped me do a prototype with LSTM language models (prior to the release of transformers); the prototype didn’t look promising enough to scale up.

• In mid-2017 Geoffrey Irving joined OpenAI and was excited about starting with RLHF and then going beyond it using debate; he also thought language models were the most important domain to study and had more conviction about that. In 2018 he started a larger team working on fine-tuning on language models, which completed its initial RLHF project in 2019. This required building significant infrastructure for scaling and working with language models, since this work was happening in parallel with GPT-2.

• Geoffrey later left for DeepMind and I took over the team. We wrote a follow-up paper polishing the result to the point where it seemed to be production-ready. Some people on the team started working on applying these results in production; Ryan Lowe ultimately led this effort which spun out into a different team (see paper). We also began working on simple settings where humans needed to use AI systems to solve subtasks (see paper). I left OpenAI at the start of 2021 to return to focusing on theory and Jan Leike took over the team.

## The case for a positive impact

Overall, I think that early work on RLHF had significant value:

• I think it is hard to productively work on more challenging alignment problems without first implementing basic solutions.

• “Solve real problems one at a time” seems like a good way to make progress and is how most fields work. Trying to justify research on problem X by saying “well we could do RLHF, but it wouldn’t fix speculative problem X” is uncompelling to most audiences if no one has implemented RLHF or observed problem X. it’s even worse if they have plenty of more mundane examples of unaligned behavior unrelated to X.

• Without implementing basic solutions it’s much harder to empirically validate your hypotheses about risks. We can make reasonable arguments about what failures will eventually occur with RLHF, but you can learn more by building the system and studying it. I think there are real, huge uncertainties here, and the safety community is taking weak arguments too seriously.

• A lot of historical work on alignment seems like it addresses subsets of the problems solved by RLHF, but doesn’t actually address the important ways in which RLHF fails. In particular, a lot of that work is only necessary if RLHF is prohibitively sample-inefficient. Determining whether RLHF has fundamental difficulties seems like a good way to improve research prioritization.

• Many more complex alignment proposals involve the same technical ingredients as RLHF, especially learning a reward from an expensive overseer. I think that debate and recursive reward modeling in particular are plausible approaches to alignment for mildly superhuman systems, and they build directly on RLHF.

• Taking ideas from theory to practice helps build expertise about how to do so, which both informs alignment research and facilitates future implementation.

• For example, a major point of disagreement between me and Eliezer is that Eliezer often dismisses plans as “too complicated to work in practice,” but that dismissal seems divorced from experience with getting things to work in practice (e.g. some of the ideas that Eliezer dismisses are not much more complex than RLHF with AI assistants helping human raters). In fact I think that you can implement complex things by taking small steps—almost all of these implementation difficulties do improve with empirical feedback.

• Moreover, this kind of expertise is directly relevant when implementing future alignment proposals even if they are very different from RLHF. The implicit alternative seems to be an alignment community that deliberately avoids any problems that would be helpful for making AI systems useful, and potentially avoids doing any engineering work at all, creating predictable and potentially huge problems with implementation.

## The case for a negative impact

People in the safety community make some arguments that research on RLHF has costs larger than these benefits. I don’t currently find these arguments persuasive:

• RLHF (and other forms of short-term “alignment” progress) make AI systems more useful and profitable, hastening progress towards dangerous capabilities.

• RLHF is just not that important to the bottom line right now. Imitation learning works nearly as well, other hacky techniques can do quite a lot to fix obvious problems, and the whole issue is mostly second order for the current bottom line. RLHF is increasingly important as time goes on, but it also becomes increasingly overdetermined that people would have done it. In general I think your expectation should be that incidental capabilities progress from safety research is a small part of total progress, given that it’s a small fraction of people, very much not focused on accelerating things effectively, in a domain with diminishing returns to simultaneous human effort. This can be overturned by looking at details in particular cases, but I think safety people making this argument mostly aren’t engaging with details in a realistic way.

• Trying to delay AI progress by avoiding making AI systems better at doing what people want feels holistically unwise. RLHF does not appear to increase the kind of capabilities that are directly relevant to risk, but instead has an indirect effect via making AI systems more useful. My intuitive reaction is similar to a proposal to lobby against improvements to the tax code so that taxes will be more painful and the public will be more opposed to new taxes. It might be OK if your goal is to reduce tax burden, but probably counterproductive for reducing the social cost of taxes.

• Avoiding RLHF at best introduces an important overhang: people will implicitly underestimate the capabilities of AI systems for longer, slowing progress now but leading to faster and more abrupt change later as people realize they’ve been wrong. Similarly, to the extent you successfully slow scaling, you are then in for faster scaling later from a lower initial amount of spending—I think it’s significantly better to have a world where TAI training runs cost $10 billion than a world where they cost$1 billion. A key background view is that the great majority of effective safety work will come when people are working with systems that are much closer to posing a risk, e.g. so they can actually exhibit and study interesting forms of reward hacking and deceptive alignment. Overall in expectation I think these effects claw back most of the benefits of slowing down progress by avoiding RLHF.

• RLHF “covers up problems” so that you can’t or won’t fix them in other ways.

• RLHF lets you produce models that don’t do bad-looking things, but there are some things which look fine but are actually bad. So you might worry that RLHF makes problems harder to study by covering up their symptoms. But we can (and do) still train models without RLHF, or using a weak overseer where outputs can be validated by stronger overseers. It seems that RLHF makes it much easier to produce realistic examples of problems—both because it facilitates settings with the kind of realistic failure modes you actually want to study (namely overpowering or misleading overseers) and because without RLHF there are going to be a thousand other hacks to try first to fix the problems.

• You might argue that RLHF gives people a way to cover up problems and so lets them avoid fixing them in deeper ways, or gives them a “false sense of security.” But in practice if people run into problems that can be fixed with RLHF, it looks like they will just do RLHF later (which is getting easier and easier over time). And in practice most of the problems that can be addressed with RLHF can be addressed in other hackier ways as well. This potential objection seems to rest on an unreasonably optimistic model about how superficial problems force people into pursuing deep fixes.

• RLHF is less safe than imitation or conditioning generative models.

• If we’re considering the danger posed by a model of a fixed level of usefulness, I think this is probably false though it’s a complicated question and I’m uncertain. The AI safety community makes various informal arguments about this which I find unpersuasive (though I mostly haven’t seen them laid out carefully). I suspect the differences are small and require empirical investigation. (While I appreciate many of the investigations in this paper and think it is good to improve our understanding, I don’t think they let us tell what’s up with risk.) This could be the subject of a much longer post and maybe will be discussed in the comments.

• If RLHF poses distinctive risks, we are overwhelmingly more likely to avoid those risks by understanding them rather than by hoping no one ever implements RLHF. It’s unrealistic and deeply unstable to hope that no one uses RLHF because they didn’t think of it.

• This entire alignment approach is impractical, and therefore all the arguments about “taking the first step in the right direction” are wrong. On top of that working on RLHF obfuscates that fact and dilutes what should be a robust community consensus.

• To the extent this is true, I think it would be a pretty powerful argument against RLHF (largely because it implies that most of the benefits aren’t real). But I don’t agree that the approach can’t work. I’ve talked about this a lot with people, but feel like the arguments just aren’t holding together. The two weak links are on (i) arguments about the timing of difficulties relative to e.g. radically superhuman models—almost all of the arguments kick in after human level and it’s just not clear how far after, (ii) the probability of deceptive alignment emerging despite simple countermeasures, which I think of as a completely open empirical question—existing arguments are fine for arguing plausibility, but definitely can’t get you to 90% rather than 50%, (iii) the feasibility of fundamental improvements to RLHF.

Overall, I think it was valuable to use RLHF to fix the kind of basic alignment problems that are ubiquitous with pre-trained models. I think it has had a real impact facilitating work on more fundamental challenges, and helped move the community one step closer towards the kind of alignment solutions I expect to ultimately be successful.

## Future work

I remain excited about “straightforward” approaches to improving RLHF, like devising better feedback (using combinations of human and AI work) and improving robustness by adversarial training. I think this work will continue to make ML systems more useful in practice, and so will be subject to the same kinds of objections as above. I still tentatively think this work is net positive and don’t find arguments against persuasive.

I think this follow-up research will also not need to solve the “fundamentally confusing” problems for a long time, but that solving tractable problems gives you a good chance of aligning modestly superhuman AI and facilitates future work on the remaining more challenging problems.

That said, I don’t think that improving or studying RLHF is automatically “alignment” or necessarily net positive. Research should be justified by an argument that it actually helps address important failures. Here are some types of work in this space that I’m particularly excited about:

• Work that addresses robustness in cases where we cannot train on deployment examples, or where we care about failure rates that are small relative to fine-tuning dataset size. In practice this would happen if failures are very high-stakes, but we can also study synthetic domains where we artificially aim at very low datasets.

• Training AI systems to give more correct answers in domains where human overseers can’t easily judge results and there is no other source of end-to-end feedback during training. That may involve giving humans better tools, studying and improving generalization from domains that do have feedback, or other methods.

• Anything that addresses clear examples of alignment failures, for which we have good reasons to believe that models “know” things they aren’t telling us, or “know” what we want them to do but nevertheless do something else. Many of these will fall into the first two categories, but it’s also interesting to fix more mundane failures (e.g. obvious untruths) if they can be clearly identified as alignment problems.

• Creating in vitro examples of problems analogous to the ones that will ultimately kill us, e.g. by showing agents engaging in treacherous turns due to reward hacking or exhibiting more and more of the core features of deceptive alignment.

• RLHF is just not that important to the bottom line right now. Imitation learning works nearly as well, other hacky techniques can do quite a lot to fix obvious problems, and the whole issue is mostly second order for the current bottom line.

I am very confused why you think this, just right after the success of Chat-GPT, where approximately the only difference from GPT-3 was the presence of RLHF.

My current best guess is that Chat-GPT alone, via sparking an arms-race between Google and Microsoft, and by increasing OpenAIs valuation, should be modeled as the equivalent of something on the order of $10B of investment into AI capabilities research, completely in addition to the gains from GPT-3. And my guess is most of that success is attributable to the work on RLHF, since that was really the only substantial difference between Chat-GPT and GPT-3. We also should not think this was overdetermined since 1.5 years passed since the release of GPT-3 and the release of Chat-GPT (with some updates to GPT-3 in the meantime, but my guess is no major ones), and no other research lab focused on capabilities had set up their own RLHF pipeline (except Anthropic, which I don’t think makes sense to use as a datapoint here, since it’s in substantial parts the same employees). I have been trying to engage with the actual details here, and indeed have had a bunch of arguments with people over the last 2 years where I have been explicitly saying that RLHF is pushing on commercialization bottlenecks based on those details, and people believing this was not the case was the primary crux on whether RLHF was good or bad in those conversations. The crux was importantly not that other people would do the same work anyways, since people at the same time also argued that their work on RLHF was counterfactually relevant and that it’s pretty plausible or likely that the work would otherwise not happen. I’ve had a few of these conversations with you as well (though in aggregate not a lot) and your take at the time was (IIRC) that it seemed quite unlikely that RLHF would have as big of an effect as it did have in the case of Chat-GPT (mostly via an efficiency argument that if that was the case, more capabilities-oriented people would work on it, and since they weren’t it likely isn’t a commercialization bottleneck), and so I do feel a bit like I want to call you out on that, though I might also be misremembering the details (some of this was online, so might be worth going back through our comment histories). • I am very confused why you think this, just right after the success of Chat-GPT, where approximately the only difference from GPT-3 was the presence of RLHF. I think the qualitative difference between the supervised tuning done in text-davinci-002 and the RLHF in text-davinci-003 is modest (e.g. I’ve seen head-to-head comparisons suggesting real but modest effects on similar tasks). I think the much more important differences are: 1. It was trained to interact directly with the end user as a conversational assistant rather than in an API intended to be used by developers. 2. It was deployed in a way that made it much easier for more people to interact with it. 3. People hadn’t appreciated progress since GPT-3, or even how good GPT-3 was, and this went viral (due to a combination of 1+2). 4. If there are large capability differences I expect they are mostly orthogonal improvements. I think the effect would have been very similar if it had been trained via supervised learning on good dialogs. My current best guess is that Chat-GPT alone, via sparking an arms-race between Google and Microsoft, and by increasing OpenAIs valuation, should be modeled as the equivalent of something on the order of$10B of investment into AI capabilities research, completely in addition to the gains from GPT-3.

ChatGPT was impactful because of a big mismatch between people’s perceptions of LM abilities and reality. That gap was going to get closed sooner or later (if not now then probably at the GPT-4 release). I think it’s reasonable to think that this was a really destructive decision by OpenAI, but I don’t think it’s reasonable to treat it as a counterfactual $10B of investment. I feel like the implicit model of the world you are using here is going to have effect sizes adding up to much more than the actual variance at stake. How impactful was the existence of OpenAI? Leadership decisions at Google? Microsoft’s willingness to invest in OpenAI? The surprising effectiveness of transformers? Google originally deciding not to scale up LMs aggressively? The training of PaLM? The original GPT-3 release decisions? The fact that LM startups are raising at billion dollar valuations? The fact that LM applications are making hundreds of millions of dollars? These sources of variance all add up to 100% of the variance in AI investment, not 100000% of the variance. I think it’s a persistent difference between us that I tend to think fundamentals matter more and you tend to think things are more contingent and random. I tend to find your causal attribution implausible in other technologies as well as AI. We also should not think this was overdetermined since 1.5 years passed since the release of GPT-3 and the release of Chat-GPT (with some updates to GPT-3 in the meantime, but my guess is no major ones) There were significant capability increases between GPT-3 an GPT-3.5 (not to mention the introduction of the earlier InstructGPT training). The crux was importantly not that other people would do the same work anyways, since people at the same time also argued that their work on RLHF was counterfactually relevant and that it’s pretty plausible or likely that the work would otherwise not happen. I’ve had a few of these conversations with you as well (though in aggregate not a lot) and your take at the time was (IIRC) that it seemed quite unlikely that RLHF would have as big of an effect as it did have in the case of Chat-GPT (mostly via an efficiency argument that if that was the case, more capabilities-oriented people would work on it, and since they weren’t it likely isn’t a commercialization bottleneck), and so I do feel a bit like I want to call you out on that, though I might also be misremembering the details (some of this was online, so might be worth going back through our comment histories). My position was and is: • RLHF was definitely going to be done sooner or later. (I’ve definitely never thought that RLHF would never happen.) • It’s valuable to do it earlier to get started on the next thing. It’s also good to push people to something cleaner and more flexible rather than something more hacky or with no knob to change the reward function. • We were doing it before it was a big deal commercially; it would have got done later when it mattered. • To be clear, sample efficiency might be high enough later that you just use the AI’s zero-shot predictions of humans instead of collecting any new specialized data, which we also discussed specifically at the time. I’m pretty skeptical that no one else would do RLHF. For ChatGPT in particular, I think it was built by John Schulman’s team, and John is: (i) focused on RL, (ii) pivoted to LMs after the success of GPT-3 relative to non-LM models and would have done so without RLHF, (iii) has a similar aesthetic and would pretty obviously do this or something else equally good. I think the most likely world where people don’t adopt RLHF is one where other hackier alternatives work just as well. And it won’t be from no one trying. I think the big argument against impact I find most compelling is: most follow-up work to RLHF didn’t work that well for GPT-3 and seem to have started working after that, so you could have just waited until people would do it anyway and in the interim focused on approaches that work better at smaller scale. I think the big miscalculation here was that I expected debate/​decomposition stuff would start working interestingly with curie-sized models but was off by about 2 orders of magnitude. I think the big argument for negative impact comes from safety-motivated folk being involved in training language models, not the RLHF stuff. I also disagree with the rationalists about their evaluations of pretty much everything, but that one feels like a more interesting disagreement. • I think the effect would have been very similar if it had been trained via supervised learning on good dialogs I don’t currently think this is the case, and seems like the likely crux. In general it seems that RLHF is substantially more flexible in what kind of target task it allows you to train for, which is the whole reason for why you are working on it, and at least my model of the difficulty of generating good training data for supervised learning here is that it would have been a much greater pain, and would have been much harder to control in various fine-grained ways (including preventing the AI from saying controversial things), which had been the biggest problem with previous chat bot attempts. For ChatGPT in particular, I think it was built by John Schulman’s team I find a comparison with John Schulman here unimpressive if you want to argue progress on this was overdetermined, given the safety motivation by John, and my best guess being that if you had argued forcefully that RLHF was pushing on commercialization bottlenecks, that John would have indeed not worked on it. Seeing RLHF teams in other organizations not directly downstream of your organizational involvement, or not quite directly entangled with your opinion, would make a bigger difference here. I feel like the implicit model of the world you are using here is going to have effect sizes adding up to much more than the actual variance at stake I don’t think so, and have been trying to be quite careful about this. Chat-GPT is just by far the most successful AI product to date, with by far the biggest global impact on AI investment and the most hype. I think$10B being downstream of that isn’t that crazy. The product has a user base not that different from other $10B products, and a growth rate to put basically all of them to shame, so I don’t think a$10B effect from Chat-GPT seems that unreasonable. There is only so much variance to go around, but Chat-GPT is absolutely massive in its impact.

• I don’t currently think this is the case, and seems like the likely crux. In-general it seems that RLHF is substantially more flexible in what kind of target task it allows you to train, which is the whole reason for why you are working on it, and at least my model of the difficulty of generating good training data for supervised learning here is that it would have been a much greater pain, and would have been much harder to control in various fine-tuned ways (including preventing the AI from saying controversial things), which had been the biggest problem with previous chat bot attempts.

I bet they did generate supervised data (certainly they do for InstructGPT), and supervised data seems way more fine-grained in what you are getting the AI to do. It’s just that supervised fine-tuning is worse.

I think the biggest problem with previous chat-bot attempts is that the underlying models are way way weaker than GPT-3.5.

I don’t think so, and have been trying to be quite careful about this. Chat-GPT is just by far the most successful AI product to date, with by far the biggest global impact on AI investment and the most hype. I think $10B being downstream of that isn’t that crazy. The product has a user base not that different from other$10B products, and a growth rate to put basically all of them to shame, so I don’t think a $10B effect from Chat-GPT seems that unreasonable. There is only so much variance to go around, but Chat-GPT is absolutely massive in its impact. This still seems totally unreasonable to me: • How much total investment do you think there is in AI in 2023? • How much variance do you think there is in the level of 2023 investment in AI? (Or maybe whatever other change you think is equivalent.) • How much influence are you giving to GPT-3, GPT-3.5, GPT-4? How much to the existence of OpenAI? How much to the existence of Google? How much to Jasper? How much to good GPUs? I think it’s unlikely that the reception of ChatGPT increased OpenAI’s valuation by$10B, much less investment in OpenAI, even before thinking about replaceability. I think that Codex, GPT-4, DALL-E, etc. are all very major parts of the valuation.

I also think replaceability is a huge correction term here. I think it would be more reasonable to talk about moving how many dollars of investment how far forward in time.

I find a comparison with John Schulman here unimpressive if you want to argue progress on this was overdetermined, given the safety motivation by John, and my best guess being that if you had argued forcefully that RLHF was pushing on commercialization bottlenecks, that John would have indeed not worked on it.

I think John wants to make useful stuff, so I doubt this.

• How much total investment do you think there is in AI in 2023?

My guess is total investment was around the $200B -$500B range, with about $100B of that into new startups and organizations, and around$100-$400B of that in organizations like Google and Microsoft outside of acquisitions. I have pretty high uncertainty on the upper end here, since I don’t know what fraction of Google’s revenue gets reinvested again into AI, how much Tesla is investing in AI, how much various governments are investing, etc. How much variance do you think there is in the level of 2023 investment in AI? (Or maybe whatever other change you think is equivalent.) Variance between different years depending on market condition and how much products take off seems like on the order of 50% to me. Like, different years have pretty hugely differing levels of investment. My guess is about 50% of that variance is dependent on different products taking off, how much traction AI is getting in various places, and things like Chat-GPT existing vs. not existing. So this gives around$50B - $125B of variance to be explained by product-adjacent things like Chat-GPT. How much influence are you giving to GPT-3, GPT-3.5, GPT-4? How much to the existence of OpenAI? How much to the existence of Google? How much to Jasper? How much to good GPUs? Existence of OpenAI is hard to disentangle from the rest. I would currently guess that in terms of total investment, GPT-2 → GPT-3 made a bigger difference than GPT-3.5 → Chat-GPT, but both made a much larger difference than GPT-3 → GPT-3.5. I don’t think Jasper made a huge difference, since its userbase is much smaller than Chat-GPT, and also evidently the hype from it has been much lower. Good GPUs feels kind of orthogonal. We can look at each product that makes up my 50% of the variance to be explained and see how useful/​necessary good GPUs were for its development, and my sense is for Chat-GPT at least the effect of good GPUs were relatively minor since I don’t think the training to move from GPT-3.5 to Chat-GPT was very compute intensive. I would feel fine saying expected improvements in GPUs are responsible for 25% of the 50% variance (i.e. 17.5%) if you chase things back all the way, though that again feels like it isn’t trying to add up to 100% with the impact from “Chat-GPT”. I do think it’s trying to add up to 100% with the impact from “RLHF’s effect on Chat-GPT”, which I claimed was at least 50% of the impact of Chat-GPT in-particular. In any case, in order to make my case for$10B using these numbers I would have to argue that between 20% and 8% of the product-dependent variance in annual investment into AI is downstream of Chat-GPT, and indeed that still seems approximately right to me after crunching the numbers. It’s by far the biggest AI product of the last few years, it is directly credited with sparking an arms race between Google and Microsoft, and indeed even something as large as 40% wouldn’t seem totally crazy to me, since these kinds of things tend to be heavy-tailed, so if you select on the single biggest thing, there is a decent chance you underestimate its effect.

• I didn’t realize how broadly you were defining AI investment. If you want to say that e.g ChatGPT increased investment by $10B out of$200-500B, so like +2-5%, I’m probably happy to agree (and I also think it had other accelerating effects beyond that).

I would guess that a 2-5% increase in total investment could speed up AGI timelines 1-2 weeks depending on details of the dynamics, like how fast investment was growing, how much growth is exogenous vs endogenous, diminishing returns curves, importance of human capital, etc.. If you mean +2-5% investment in a single year then I would guess the impact is < 1 week.

I haven’t thought about it much, but my all things considered estimate for the expected timelines slowdown if you just hadn’t done the ChatGPT release is probably between 1-4 weeks.

Is that the kind of effect size you are imagining here? I guess the more important dynamic is probably more people entering the space rather than timelines per se?

One thing worth pointing out in defense of your original estimate is that variance should add up to 100%, not effect sizes, so e.g. if the standard deviation is $100B then you could have 100 things each explaining ($10B)^2 of variance (and hence each responsible for +-$10B effect sizes after the fact). • I didn’t realize how broadly you were defining AI investment. If you want to say that e.g ChatGPT increased investment by$10B out of $200-500B, so like +2-5%, I’m probably happy to agree (and I also think it had other accelerating effects beyond that). Makes sense, sorry for the miscommunication. I really didn’t feel like I was making a particularly controversial claim with the$10B, so was confused why it seemed so unreasonable to you.

I do think those $10B are going to be substantially more harmful for timelines than other money in AI, because I do think a good chunk of that money will much more directly aim at AGI than most other investment. I don’t know what my multiplier here for effect should be, but my guess is something around 3-5x in expectation (I’ve historically randomly guessed that AI applications are 10x less timelines-accelerating per dollar than full-throated AGI-research, but I sure have huge uncertainty about that number). That, plus me thinking there is a long tail with lower probability where Chat-GPT made a huge difference in race dynamics, and thinking that this marginal increase in investment does probably translate into increases in total investment, made me think this was going to shorten timelines in-expectation by something closer to 8-16 weeks, which isn’t enormously far away from yours, though still a good bit higher. And yeah, I do think the thing I am most worried about with Chat-GPT in addition to just shortening timelines is increasing the number of actors in the space, which also has indirect effects on timelines. A world where both Microsoft and Google are doubling down on AI is probably also a world where AI regulation has a much harder time taking off. Microsoft and Google at large also strike me as much less careful actors than the existing leaders of AGI labs which have so far had a lot of independence (which to be clear, is less of an endorsement of current AGI labs, and more of a statement about very large moral-maze like institutions with tons of momentum). In-general the dynamics of Google and Microsoft racing towards AGI sure is among my least favorite takeoff dynamics in terms of being able to somehow navigate things cautiously. One thing worth pointing out in defense of your original estimate is that variance should add up to 100%, not effect sizes, so e.g. if the standard deviation is$100B then you could have 100 things each explaining ($10B)^2 of variance (and hence each responsible for +-$10B effect sizes after the fact).

Oh, yeah, good point. I was indeed thinking of the math a bit wrong here. I will think a bit about how this adjusts my estimates, though I think I was intuitively taking this into account.

• And yeah, I do think the thing I am most worried about with Chat-GPT in addition to just shortening timelines is increasing the number of actors in the space, which also has indirect effects on timelines. A world where both Microsoft and Google are doubling down on AI is probably also a world where AI regulation has a much harder time taking off.

Maybe—but Microsoft and Google are huge organizations, and huge organizations have an incentive to push for regulation that imposes costs that they can pay while disproportionately hampering smaller competitors. It seems plausible to me that both M & G might prefer a regulatory scheme that overall slows down progress while cementing their dominance, since that would be a pretty standard regulatory-capture-driven-by-the-dominant-actors-in-the-field kind of scenario.

A sudden wave of destabilizing AI breakthroughs—with DALL-E/​Midjourney/​Stable Diffusion suddenly disrupting art and Chat-GPT who-knows-how-many-things—can also make people on the street concerned and both more supportive of AI regulation in general, as well as more inclined to take AGI scenarios seriously in particular. I recently saw a blog post from someone speculating that this might cause a wide variety of actors—M & G included—with a desire to slow down AI progress to join forces to push for widespread regulation.

• It seems plausible to me that both M & G might prefer a regulatory scheme that overall slows down progress while cementing their dominance, since that would be a pretty standard regulatory-capture-driven-by-the-dominant-actors-in-the-field kind of scenario.

Interesting. Where did something like this happen?

• I asked Chat-GPT and one of the clearest examples it came up with is patent trolling by large pharmaceutical companies. Their lobbying tends to be far more focused on securing monopoly rights to their products for as long as possible than anything related to innovation.

Other examples:

• Automakers lobbying for restrictive standards for potential market disruptors like electric or self-driving vehicles

• Telecoms lobbying against Net Neutrality

• Taxi companies lobbying against ridesharing startups

• Tech companies lobbying for intellectual property and data privacy regulations that they have better legal/​compliance resources to handle

• Good GPUs feels kind of orthogonal.

IMO it’s much easier to support high investment numbers in “AI” if you consider lots of semiconductor /​ AI hardware startup stuff as “AI investments”. My suspicion is that while GPUs were primarily a crypto thing for the last few years, the main growth outlook driving more investment is them being an AI thing.

• I’d be interested to know how you estimate the numbers here, they seem quite inflated to me.

If 4 big tech companies were to invest $50B each in 2023 then, assuming average salary as$300k and 2:1 capital to salary then investment would be hiring about 50B/​900K = 55,000 people to work on this stuff. For reference the total headcount at these orgs is roughly 100-200K.

50B/​yr is also around 25-50% of the size of the total income, and greater than profits for most which again seems high.

Perhaps my capital ratio is way too low but I would find it hard to believe that these companies can meaningfully put that level of capital into action so quickly. I would guess more on the order of $50B between the major companies in 2023. Agree with paul’s comment above that timeline shifts are the most important variable. • Supervised data seems way more fine-grained in what you are getting the AI to do. It’s just that supervised fine-tuning is worse. My (pretty uninformed) guess here is that supervised fine-tuning vs RLHF has relatively modest differences in terms of producing good responses, but bigger differences in terms of avoiding bad responses. And it seems reasonable to model decisions about product deployments as being driven in large part by how well you can get AI not to do what you don’t want it to do. • And it seems reasonable to model decisions about product deployments as being driven in large part by how well you can get AI not to do what you don’t want it to do. It depends a lot on the use case. When it comes to what I’m doing with ChatGPT, I care more about the quality of the best answer when I generate five answers to a prompt than I care about the quality of the worst answer. I can choose the best answer myself and ignore the others. Many use cases have ways to filter for valuable results either automatically or by letting a human filter. • I think it’s unlikely that the reception of ChatGPT increased OpenAI’s valuation by$10B, much less investment in OpenAI, even before thinking about replaceability.

Note that I never said this, so I am not sure what you are responding to. I said Chat-GPT increases investment in AI by 10B, not that it increased investment into specifically OpenAI. Companies generally don’t have perfect mottes. Most of that increase in investment is probably in internal Google allocation and in increased investment into the overall AI industry. • I think the qualitative difference between the supervised tuning done in text-davinci-002 and the RLHF in text-davinci-003 is modest (e.g. I’ve seen head-to-head comparisons suggesting real but modest effects on similar tasks). Ok, I think we might now have some additional data on this debate. It does indeed look like to me that Sydney was trained with the next best available technology after RLHF, for a few months, at least based on Gwern’s guesses here: https://​​www.lesswrong.com/​​posts/​​jtoPawEhLNXNxvgTT/​​bing-chat-is-blatantly-aggressively-misaligned?commentId=AAC8jKeDp6xqsZK2K As far as I can tell this resulted in a system with much worse economic viability than Chat-GPT. I would overall describe Sydney as “economically unviable”, such that if Gwern’s story here is correct, the difference between using straightforward supervised training on chat transcripts and OpenAIs RLHF pipeline is indeed the difference between an economically viable and unviable product. There is a chance that Microsoft fixes this with more supervised training, but my current prediction is that they will have to fix this with RLHF, because the other technological alternatives are indeed no adequate substitutes from an economic viability perspective, which suggests that the development of RLHF did really matter a lot for this. • Benchmarking on static datasets on ordinary tasks (typically not even adversarially collected in the first place) may not be a good way to extrapolate to differences in level of abuse for PR-sensitive actors like megacorps, especially for abusers that are attacking the retrieval functionality (as Sydney users explicitly were trying to populate Bing hits to steer Sydney), a functionality not involved in said benchmarking at all. Or to put it another way, the fact that text-davinci-003 does only a little better than text-davinci-002 in terms of accuracy % may tell you little about how profitable in each will be once 4chan & the coomers get their hands on it… It is not news to anyone here that average-case performance on proxy metrics on some tame canned datasets may be unrelated to out-of-distribution robustness on worst-case adversary-induced decision-relevant losses, in much the same way that model perplexity tells us little about what a model is useful for or how vulnerable it is.

• Yeah, this is basically my point. Not sure whether whether you are agreeing or disagreeing. I was specifically quoting Paul’s comment saying “I’ve seen only modest qualitative differences” in order to disagree and say “I think we’ve now seen substantial qualitative differences”.

We have had 4chan play around with Chat-GPT for a while, with much less disastrous results than what happened when they got access to Sydney.

It is not news to anyone here that average-case performance on proxy metrics on some tame canned datasets may be unrelated to out-of-distribution robustness on worst-case adversary-induced decision-relevant losses, in much the same way that model perplexity tells us little about what a model is useful for or how vulnerable it is.

I wish that this not being news to anyone here was true but this does not currently seem true to me. But doesn’t seem worth going into.

• I was elaborating in more ML-y jargon, and also highlighting that there are a lot of wildcards omitted from Paul’s comparison: retrieval especially was an interesting dynamic.

• For what it’s worth, I buy the claim from Gwern that Microsoft trained Sydney pretty poorly, much worse than is achievable with SFT on highly rated data. For example, Sydney shows significant repetition, which you don’t see even on text-davinci-002 or (early 2022) LaMDA, both trained without RLHF.

• Yep, I think it’s pretty plausible this is just a data-quality issue, though I find myself somewhat skeptical of this. Maybe worth a bet?

I would be happy to bet that conditional on them trying to solve this with more supervised training and no RLHF, we are going to see error modes substantially more catastrophic than current Chat-GPT.

• Feb 1 (Reuters) - ChatGPT, the popular chatbot from OpenAI, is estimated to have reached 100 million monthly active users in January, just two months after launch, making it the fastest-growing consumer application in history, according to a UBS study on Wednesday.

The report, citing data from analytics firm Similarweb, said an average of about 13 million unique visitors had used ChatGPT per day in January, more than double the levels of December.

“In 20 years following the internet space, we cannot recall a faster ramp in a consumer internet app,” UBS analysts wrote in the note.

I had some decent probability on this outcome but I have increased my previous estimate of the impact of Chat-GPT by 50%, since I didn’t expect something this radical (“the single fastest growing consumer product in history”).

• I feel like the implicit model of the world you are using here is going to have effect sizes adding up to much more than the actual variance at stake.

That’s not always the wrong thing to do—the sum of counterfactual impacts of the actions of many actors often sums up to greater than their total combined impact. A simple example would be if two co-founders of an impactful company wouldn’t have been a founder without the other. Then the sum of their counterfactual impacts is equivalent to 2 times the total impact of the company.

While I don’t have an opinion on this particular case, you could imagine that additional AI investment may not have happened if either of the following were true:

1. The original RLHF proof of concept from OpenAI didn’t happen—because Google’s leadership wouldn’t have the incentive for further investment.

2. If Google’s leadership were different—because they may not have thought to invest more money in AI.

• my guess is most of that success is attributable to the work on RLHF, since that was really the only substantial difference between Chat-GPT and GPT-3

I don’t think this is right—the main hype effect of chatGPT over previous models feels like it’s just because it was in a convenient chat interface that was easy to use and free. My guess is that if you did a head-to-head comparison of RLHF and kludgey random hacks involving imitation and prompt engineering, they’d seem similarly cool to a random journalist /​ VC, and generate similar excitement.

• I don’t think this is right—the main hype effect of chatGPT over previous models feels like it’s just because it was in a convenient chat interface that was easy to use and free.

I don’t have extensive relevant expertise, but as a personal datapoint: I used Davinci-002 multiple times to generate an interesting dialogue in order to test its capabilities. I ran several small-scale Turing tests, and the results were quite unimpressive in my opinion. When ChatGPT came out, I tried it out (on the day of its release) and very quickly felt that it was qualitatively better at dialogue. Of course, I could have simply been prompting Davinci-002 poorly, but overall I’m quite skeptical that the main reason for ChatGPT hype was that it had a more convenient chat interface than GPT-3.

• I’ve felt that ChatGPT was roughly on par with text-davinci-003, though much more annoying and with a worse interface.

• That makes sense. However, Davinci-003 came out just a few days prior to ChatGPT. The relevant transition was from Davinci-002 to Davinci-003/​ChatGPT.

• Yep, and text-davinci-002 was trained with supervised finetuning /​ written demos, while 003 was trained with RLHF via PPO. Hypothetically, the clearest illustration of RLHF’s capabilities gains should be from comparing 002 to 003. However, OpenAI could have also used other methods to improve 003, such as with Transcending Scaling Laws with 0.1% Extra Compute.

Our models generally used the best available datasets at the time of training, and so different engines using the same training methodology might be trained on different data.

So I guess 003 could also have different base pretraining data?

• [edit: this says the same thing as Quintin’s sibling comment]

Important context for those who don’t know it: the main difference between text-davinci-002 and text-davinci-003 is that the latter was trained with PPO against a reward model, i.e. RLHF as laid out in the InstructGPT paper. (Source: OpenAI model index.)

In more detail, text-davinci-002 seems to have been trained via supervised fune-tuning on the model outputs which were rated highest by human reviewers (this is what the model index calls FeedME). The model index only says that text-davinci-003 was trained via PPO against a reward model, but this was after SFT on human demonstrations, and might have also been after FeedME training.

(Aside: the terminology “RLHF” is starting to become confusing, as some people use it narrowly to mean “PPO against a reward model” and others use it more broadly to mean “using any RL technique with a reward signal given by human reviewers,” which would include FeedME.)

• The terminology “RLHF” is starting to become confusing, as some people use it narrowly to mean “PPO against a reward model” and others use it more broadly to mean “using any RL technique with a reward signal given by human reviewers,” which would include FeedME.

Sorry for getting off track, but I thought FeedME did not use RL on the final model, only supervised training? Or do you just mean that the FeedME-trained models may have been fed inputs from models that had been RL-finetuned (namely the one from the InstructGPT paper)? Not sure if OpenAI said anywhere whether the latter was the case, or whether FeedME just uses inputs from non-RL models.

• This is just a terminological difference: supervised fine-tuning on highly rated outputs is a type of RL. (At least according to how many people use the term.)

• Got a source for that? This seems like an odd way to use the term, in particular because with supervised fine-tuning there’s no credit assignment over time, and so it doesn’t train the model to actually aim towards high-reward states.

• To be clear, I’m not classifying all uses of SFT as RL (for example, I would not call SFT on human expert demonstrations RL). It’s specifically SFT on highly-rated model outputs—i.e. having the model produce a bunch of rollouts, labeling them with rewards, training the model to imitate the top-rewarded rollouts, and repeating—which I’m calling RL here. Note that this training process does aim the model towards high-reward, and is very similar to the online decision transformer, which is typically classed as an RL technique.

So I still feel that the way I used the term “RL” was in line with normal usage. But if people still disagree now that I’ve explained myself in more detail, I’d be interested in hearing why.

• Two central features of RL in my mind, which distinguish it from imitation learning:

• Receiving reward in a given state make the policy more likely to navigate to that state in general (not just via the specific pathway in which it happened to reach that state) - i.e. there’s efficient credit assignment through time.

• (In theory) small differences in reward can lead to big differences in behavior, i.e. there’s mode collapse to the highest-expected-reward policy.

Q-learning is therefore a central example of RL, alongside actor-critic algorithms.

Online REINFORCE has very dumb credit assignment, but it does eventually leads to mode collapse to highest-expected-reward policy. So I count this as… like 75% RL, but a less central example than Q-learning.

Online high-rated SFT also has poor credit assignment, in a similar way as online REINFORCE. Meanwhile, whether or not it converges to the highest-reward policy depends on how the ratings are generated. If there’s a bucket of high-reward trajectories such that all sufficiently-good trajectories go in there, then it’ll never learn to do better than a typical trajectory from that bucket. This feels more like online imitation learning (e.g. stuff like DAgger) which people don’t call RL.

By contrast, if there’s an underlying “true” reward function and the probability that a trajectory is highly-rated depends (monotonically) on its true reward, then eventually it’ll converge to only ever taking the highest-reward trajectories, which feels more centrally RL to me.

Idk how much sense this makes, it all feels a bit fuzzy. My immediate conclusion is that we should mostly care about the three traits of “online”, “state-wise credit assignment” and “converges to sharp optimum” separately, rather than trying to figure out which combination of them counts as RL (except that anything with state-wise credit assignment is definitely RL).

• I appreciate your clear articulation of the point about incentivizing the agent to navigate to high-reward states in a trajectory-independent way (in contrast to learning to produce trajectories like those which historically got high reward). That said, I’m confused about how you’ve labeled the methods you mention as having vs. not having this property.

To make sure we’re on the same page, suppose we’re in an environment with a state which is high reward, and suppose that there are two ways to get to state : via the two trajectories and . Suppose further that historically the agent has only navigated to this state via the former trajectory .

I agree that if the agent was trained via REINFORCE and finds itself in state that it might not know to take action (because it’s only been reinforced to take action from state , and not to reach state ; and also because it might not know that would transition it to state ).

But this also seems true if the agent were trained via Q-learning with a Q-function Q(s,a): the Q-function need not have learned that is large, only that is large.

In either the REINFORCE or the Q-learning case, once the agent sees a trajectory , it will make an update towards taking action from state , but the size of the update seems to depend on details about the network implementing the policy or Q-function—if there’s some obvious reason that the Q-learner will necessarily make a larger update, I’ve missed it.

I think the above also applies in the case of actor-critic methods where the critic is implemented by a Q-function. And I think it still applies even if the critic is a value function V(s), but I’m less confident: the critic has the assumption baked in that rewards come only from states, but the actor still doesn’t, so this might have similar dynamics to REINFORCE. (And if it ends up that this does do better, it’s only by baking in an assumption about the environment—that rewards come from the states and not the specific trajectories—which isn’t true in all environments.)

So I don’t follow why Q-learning and actor-critic methods on one hand, and REINFORCE and FeedME on the other hand, lie on opposite sides of the “learn to navigate to high-reward states in a trajectory-independent way” spectrum.

(I enjoyed thinking through the details here, by the way, so thanks for prompting that.)

• I think your example is too simple to capture the relevant phenomenon. Here’s one which does: suppose state s3 gives high reward, state s4 gives medium reward, and state s5 gives low reward. You’ve seen the following trajectories:

s2 → s3

s1 → s4

s1 → s2 → s5

Then q-learning will learn quickly that it should go s1 → s2 → s3, whereas REINFORCE and SFT will need to do further exploration before learning that.

I feel uncertain about how to think about the implications of this claim in the context of more complex environments, though. In some sense it only happens because q-learning is doing a one-step lookahead, which isn’t really scalable. (That also isn’t true of all critics.)

It feels like I might have just come up with a new name for “RL algorithms which work on offline data”, which is presumably not a crucial distinction.

• Ah, nice example! I now see your point, and I agree with everything you wrote. Whereas REINFORCE and SFT only incentivize actions which in fact were historically part of high-reward trajectories, Q-learning and actor-critic incentivize actions which comprise trajectories that one can infer would be high-reward (even if those actions never actually appeared in high-reward trajectories previously).

• Flagging that I would find that use of the term super confusing.

• To throw in another perspective, I’ve been working with the OpenAI API models most days of the week for the past year or so. For my uses, the step-change in quality came from moving from base davinci to text-davinci-002, whereas the improvements moving from that to text-davinci-003 were decidedly less clear.

• I agree the difference between base and 002 is bigger than the difference between 002 and 003. The base model needs to be carefully coaxed into a scenario where plausible continuations of the prompt align with your intended output, and even then it’s very inclined to repeat stuff and degenerates quickly. By contrast, you can just tell 002 what to do, and it will usually at least try to do what you say.

• Seems like you’re implying that davinci is the base model for 002 and 003. That’s not the case; davinci has one base model (GPT-3) and then 002 and 003 share a different base model (GPT-3.5).

• Fair. I think the crucial question to Ajeya & Matthew’s discussion of “Why the hype now?” is exactly how much worse the non-RLHF models that had been available since at least last March (davinci, code-davinci-002, text-davinci-002) actually were than the RLHF models made available just recently (text-davinci-003 and ChatGPT’s underlying model). I stand by the opinion that the besides the new chat stuff, most of the improvement happened within the old cohort, rather than between cohorts, so I attribute the recent hype to the convenient and free chat interface.

• People seem pretty impressed with CharacterAI, which seems to get most of its character-specific info from prompting and having finetuned on roleplay dialog. However, it’s also possible that CharacterAI’s base models are RLHF’d to be consistent roleplayers.

• Would love to learn more about the model(s) behind CharacterAI. Anyone know if there’s publicly available information on them?

• I think the part where it has a longer memory/​coherence feels like a major shift (having gotten into the flow of experimenting with GPT3 in the month prior to chatGPT, I felt like the two interfaces were approximately as convenient)

I don’t know what mechanism was used to generate the longer coherence though.

• Thanks for this post! I wanted to write a post about my disagreements with RLHF in a couple weeks, but your treatment is much more comprehensive than what I had in mind, and from a more informed standpoint.

I want to explain my position on a couple points in particular though—they would’ve been a central focus of what I imagined my post to be, points around which I’ve been thinking a lot recently. I haven’t talked to a lot of people about this explicitly so I don’t have high credence in my take, but it seems at least worth clarifying.

RLHF is less safe than imitation or conditioning generative models.

My picture on why taking ordinary generative models and conditioning them to various ends (like accelerating alignment, for example) is useful relies on a key crux that the intelligence we’re wielding is weighted by our world prior. We can expect it to be safe insofar as things normally sampled from the distribution underlying our universe is, modulo arbitrarily powerful conditionals (which degrade performance to an extent anyway) while moving far away from the default world state.

So here’s one of my main reasons for not liking RLHF: it removes this very satisfying property. Models that have been RLHF’d (so to speak), have different world priors in ways that aren’t really all that intuitive (see Janus’ work on mode collapse, or my own prior work which addresses this effect in these terms more directly since you’ve probably read the former). We get a posterior that doesn’t have the nice properties we want of a prior based directly on our world, because RLHF is (as I view it) a surface-level instrument we’re using to interface with a high-dimensional ontology. Making toxic interactions less likely (for example) leads to weird downstream effects in the model’s simulations because it’ll ripple through its various abstractions in ways specific to how they’re structured inside the model, which are probably pretty different from how we structure our abstractions and how we make predictions about how changes ripple out.

So, using these models now comes with the risk that when we really need them to work for pretty hard tasks, we don’t have the useful safety measures implied by being weighted by a true approximation of our world.

Another reason for not liking RLHF that’s somewhat related to the Anthropic paper you linked: because most contexts RLHF is used involve agentic simulacra, RLHF focuses the model’s computation on agency in some sense. My guess is that this explains to an extent the results in that paper—RLHF’d models are better at focusing on simulating agency, agency is correlated with self-preservation desires, and so on. This also seems dangerous to me because we’re making agency more accessible to and powerful from ordinary prompting, more powerful agency is inherently tied to properties we don’t really want in simulacra, and said agency of a sort is sampled from a not-so-familiar ontology to boot.

(Only skimmed the post for now because I’m technically on break, it’s possible I missed something crucial).

• I think Janus’ post on mode collapse is basically just pointing out that models lose entropy across a wide range of domains. That’s clearly true and intentional, and you can’t get entropy back just by turning up temperature. The other implications about how RLHF changes behavior seem like they either come from cherry-picked and misleading examples or just to not be backed by data or stated explicitly.

So, using these models now comes with the risk that when we really need them to work for pretty hard tasks, we don’t have the useful safety measures implied by being weighted by a true approximation of our world.

If predicting webtext is a good way to get things done, people can do that. But probably it isn’t, and so people probably won’t do that unless you give them a good reason.

That said, almost all the differences that Janus and you are highlighting emerge from supervised fine-tuning. I don’t know in what sense “predict human demonstrators” is missing an important safety property from “predict internet text,” and right now it feels to me like kind of magical thinking.

The main way I can see it going is that you can condition the webtext model on other things like “there is a future AGI generating this text...” or “What action leads to consequence X?” But I think those things are radically less safe than predicting demonstrations in the lab, and lead to almost all the same difficulties if they in fact improve capabilities.

Maybe the safety loss comes from “produce things that evaluators in the lab like” rather than “predict demonstrations in the lab”? There is one form of this I agree with—models trained with RLHF will likely try to produce outputs humans rate highly, including by e.g. producing outputs that drive humans insane to give them a good rating or whatever. But overall people seem to be reacting to some different more associative reason for concern that I don’t think makes sense (yet).

Another reason for not liking RLHF that’s somewhat related to the Anthropic paper you linked: because most contexts RLHF is used involve agentic simulacra, RLHF focuses the model’s computation on agency in some sense.

So does conditioning the model to get it to do something useful. Also I think “focuses the model’s computation on agency in some sense” is probably too vague to be a helpful way to think about what’s going on—it seems like it leads the model to produce outputs that it thinks would have certain kinds of consequences, or that imitate the kinds of heuristics and processes used by consequentialists in the dataset. This happens quite a lot when you continue webtext, since it’s all written by consequentialists.

• I think Janus’ post on mode collapse is basically just pointing out that models lose entropy across a wide range of domains. That’s clearly true and intentional, and you can’t get entropy back just by turning up temperature.

I think I agree with this being the most object-level takeaway; my take then would primarily be about how to conceptualize this loss of entropy (where and in what form) and what else it might imply. I found the “narrowing the prior” frame rather intuitive in this context.

That said, almost all the differences that Janus and you are highlighting emerge from supervised fine-tuning. I don’t know in what sense “predict human demonstrators” is missing an important safety property from “predict internet text,” and right now it feels to me like kind of magical thinking.

I agree that everything I said above qualitatively applies to supervised fine-tuning as well. As I mentioned in another comment, I don’t expect the RL part to play a huge role until we get to wilder applications. I’m worried about RLHF more because I expect it to be scaled up a lot more in the future, and plausibly does what fine-tuning does better (this is just based on how more recent models have shifted to using RLHF instead of ordinary fine-tuning).

I don’t think “predict human demonstrators” is how I would frame the relevant effect from fine-tuning. More concretely, what I’m picturing is along the lines of: If you fine-tune the model such that continuations in a conversation are more polite/​inoffensive (where this is a stand-in for whatever “better” rated completions are), then you’re not learning the actual distribution of the world anymore. You’re trying to learn a distribution that’s identical to ours except in that conversations are more polite. In other words, you’re trying to predict “X, but nicer”.

The problem I see with this is that you aren’t just affecting this in isolation, you’re also affecting the other dynamics that these interact with. Conversations in our world just aren’t that likely to be polite. Changing that characteristic ripples out to change other properties upstream and downstream of that one in a simulation. Making this kind of change seems to lead to rather unpredictable downstream changes. I say seems because -

The other implications about how RLHF changes behavior seem like they either come from cherry-picked and misleading examples or just to not be backed by data or stated explicitly.

- This is interesting. Could you elaborate on this? I think this might be a crux in our disagreement.

Maybe the safety loss comes from “produce things that evaluators in the lab like” rather than “predict demonstrations in the lab”?

I don’t think the safety loss (at least the part I’m referring to here) comes from the first-order effects of predicting something else. It’s the second-order effects on GPT’s prior at large from changing a few aspects that seems to have hard-to-predict properties and therefore worrying to me.

So does conditioning the model to get it to do something useful.

I agree. I think there’s a qualitative difference when you’re changing the model’s learned prior rather than just conditioning, though. Specifically, where ordinary GPT has to learn a lot of different processes at relatively similar fidelity to accurately simulate all the different kinds of contexts it was trained on, fine-tuned GPT can learn to simulate some kinds of processes with higher fidelity at the expense of others that are well outside the context of what it’s been fine-tuned on.

(As stated in the parent, I don’t have very high credence in my stance, and lack of accurate epistemic status disclaimers in some places is probably just because I wanted to write fast).

• I mostly care about how an AI selected to choose actions that lead to high reward might select actions that disempower humanity to get a high reward, or about how an AI pursuing other ambitious goals might choose low loss actions instrumentally and thereby be selected by gradient descent.

Perhaps there are other arguments for catastrophic risk based on the second-order effects of changes from fine-tuning rippling through an alien mind, but if so I either want to see those arguments spelled out or more direct empirical evidence about such risks.

• One consequence downstream of this that seems important to me in the limit:

1. Nonconditioning fine-tuned predictor models make biased predictions. If those biases happen to take the form of a misaligned agent, the model itself is fighting you.

2. Conditioned predictor models make unbiased predictions. The conditioned sequence could still represent a misaligned agent, but the model itself is not fighting you.

I think having that one extra layer of buffer provided by 2 is actually very valuable. A goal agnostic model (absent strong gradient hacking) seems more amenable to honest and authentic intermediate reporting and to direct mechanistic interpretation.

• Just a note here: I would not interpret fine-tuned GPTs as still “predicting” tokens. Base models predict tokens by computing a probability distribution conditional on the prompt, but for fine-tuned models this distribution no longer represents probabilities, but some “goodness” relative to the fine-tuning, how good the continuation is. Tokens with higher numbers are then not necessarily more probable continuations of the prompt (though next token probability may also play a role) but overall “better” in some opaque way. We hope that what the model thinks is a better token for the continuation of the prompt corresponds to the goals of being helpful, harmless and honest (to use the Anthropic terminology), but whether the model has really learned those goals, or merely something which looks similar, is ultimately unknown.

So RLHF (and equally supervised fine-tuning) also leads to a lack of interpretability. It is unknown what exactly an instruction model like ChatGPT or text-davinci-003 optimizes for. In contrast to this, we know pretty exactly what a base model optimized for: Next token prediction.

• You know exactly what both models are optimized for: log loss on the one hand, an unbiased estimator of reward on the other.

You don’t know what either model is optimizing: how would you? In both cases you could guess that they may be optimizing something similar to what they are optimized for.

• This relates to what you wrote in the other thread:

I don’t know in what sense “predict human demonstrators” is missing an important safety property from “predict internet text,” and right now it feels to me like kind of magical thinking.

It think the difference is that a base language model is trained on vast amounts of text, so it seems reasonable that it is actually quite good at next token prediction, while the fine-tuning is apparently done with comparatively tiny amounts of preference data. So misalignment seems much more likely in the latter case.

Moreover, human RLHF raters are probably biased in various ways, which encourages the model to reproduce those biases, even if the model doesn’t “believe them” in some sense. For example, some scientists have pointed out that ChatGPT gives politically correct but wrong answers to certain politically taboo but factual questions. (I can go into more detail if required.) Whether the model is honest here and in fact “believes” those things, or whether it is deceptive and just reproduces rater bias rather than being honest, is unknown.

So learning to predict webtext from large amounts of training data, and learning some kind of well-aligned utility function from a small number of (biased) human raters seem problems of highly uneven difficulty and probability of misalignment.

• Agreed, though I do find framing them as a warped predictor helpful in some cases. In principle, the deviation from the original unbiased prediction over all inputs should include within it all agentic behaviors, and there might exist some way that you could extract goals from that bias vector. (I don’t have anything super concrete here and I’m not super optimistic that this framing gives you anything extra compared to other interpretability mechanisms, but it’s something I’ve thought about poking.)

• What do you mean when you say the model is or is not “fighting you”?

• I mean a model “fights” you if the model itself has goals and those goals are at odds with yours. In this context, a model cannot “fight” you if it does not have goals. It can still output things which are bad for you, like an agentic simulacrum that does fight you.

I suspect effective interventions are easier to find when dealing with a goal agnostic model simulating a potentially dangerous agent, compared to a goal-oriented model that is the potentially dangerous agent.

• In both cases the model produces actions that are expected to have certain kinds of effects. Could you spell out what kind of “fighting” happens, or what kind of “intervention” is possible when you are merely conditioning your model and not fine-tuning it?

I haven’t engaged much with this kind of thinking on LW or the broader safety community, but right now I don’t really get it and it feels like anthropomorphizing or magical thinking.

• I’ll start with a pretty uncontroversial example that’s neither RLHF nor conditioning but tries to point at a shared intuition; two different models:
1. LLM fine tuned with RL, where reward comes from some kind of activation-reading truth probes.
2. LLM that trains on the output of the first model to the point where it ~perfectly matches its final output, but does not undergo any additional fine tuning.

Despite having identical final outputs, I would expect the first model to have higher probe-reported truthiness because it was optimized against that metric.

With the way I was using the word “fighting”, I would say that the first model is fighting you (a little bit), and the second one isn’t. The first model itself has learned adversarial weights that directly interfere with efforts to understand it.

Next, an impractical and extreme example, again with two models:
1. LLM fine tuned with RLHF for apparent honesty, but (for the purposes of the hypothetical) it ended up deceptive somehow.
2. “LLM” operating at an intractably low level of simulation, closer to physics, without fine tuning, which was conditioned to output a sequence which maps to the exact same deceptive behavior as the first model.

To be clear, I don’t think operating “closer to physics” here is helpful for safety, but I think it’s easier to say that the deception in the second model’s outputs is not arising from attractors defined in the model directly. The second model is simulating processes which yield deception, but it does not directly “care” about deception any more than the real laws of physics do when a real sociopath is being deceptive.

The first model is more like the sociopath itself, because the deceptive processes are a part of what the model was optimized to compute (in this hypothetical).

Directly querying the first model gets you lies; querying the second model gets you a an authentic piece of the process that yields lies.

The next obvious question here is: how does caring about some intermediate state in the computation that yields deception help? Why is this any better than looking at some intermediate layer’s activations in the first deceptive model?

My answer would be: myopic action through calibration.

The final output of the nonconditioned fine tuned deceptive model is no longer representing the probability distribution across a ground truth. The “probabilities” it outputs include a bias which, here, encodes goals corresponding to deception. Any intermediate layer in the model could be arbitrarily warped to serve those goals.

The second model (the almost-physics simulator) does not have those degrees of freedom. Goal oriented behavior diverging from the training distribution must show up as some form of bias in prediction, and that bias is directly counteracted by the training objective. (Raw GPT-esque models consider the future in their predictions in various ways, but the actual prediction remains myopic- the model is not incentivized to pick the “wrong” token now to serve a later prediction or to otherwise smuggle in distribution-breaking bias.)

In other words, in the absence of strong gradient hacking or similarly exotic phenomena, the pure token predictor model remains goal agnostic. There is no direct or indirect training pressure to obfuscate its internal process, and its outputs will not be “intentionally” deceptive; they’re just authentic attempts at probability distributions.

Similar to the original example’s second model (which was also a pure predictor), I would expect that such a model is not adversarial and is easier to interpret.

With that in mind, we can pull the predictor up to a more practical level of execution, similar to that of the other deceptive model (instead of ‘almost physics’), and all of the same properties still hold.

Conditioning models for helpfulness, e.g. with decision transformers, doesn’t change the training objective either. It’s just a change of inputs (conditions) like any other tokens, so the same properties should hold again.

In another comment, you mention:

I don’t know in what sense “predict human demonstrators” is missing an important safety property from “predict internet text,” and right now it feels to me like kind of magical thinking.

I agree with this. My concern is about forms of fine tuning that aren’t equivalent to well-calibrated predictions of human demonstrators, and about training mechanisms that take an indirect/​exploit-prone route to something that looks like predictions of human demonstrators.

I don’t think the more general form of RLHF is inherently broken. I just suspect that fine tuning that preserves model-level goal agnosticism will produce less adversarial models.

• Regarding your points on agentic simulacra (which I assume means “agentic personas the language model ends up imitating”):

1) My best guess about why Anthropic’s model expressed self-preservation desires is the same as yours: the model was trying to imitate some relatively coherent persona, this persona was agentic, and so it was more likely to express self-preservation desires.

2) But I’m pretty skeptical about your intuition that RLHF makes the “imitating agentic personas” problem worse. When people I’ve spoken to talk about conditioning-based alternatives to RLHF that produce a chatbot like the one in Anthropic’s paper, they usually mean either:

(a) prompt engineering; or

(b) having the model produce a bunch of outputs, annotating the outputs with how much we liked them, retraining the model on the annotated data, and conditioning the model to producing outputs like the ones we most liked. (For example, we could prefix all of the best outputs with the token “GOOD” and then ask the model to produce outputs which start with “GOOD”.)

Approach (b) really doesn’t seem like it will result in less agentic personas, since I imagine that imitating the best outputs will result in imitating an agentic persona just as much as fine-tuning for good outputs with a policy gradient method would. (Main intuition here: the best outputs you get from the pretrained model will already look like they were written by an agentic persona, because those outputs were produced by the pretrained model getting lucky and imitating a useful persona on that rollout, and the usefulness of a persona is correlated with its agency.)

I mostly am skeptical that approach (a) will be able to produce anything as useful as Anthropic’s chatbot. But to the extent that it can, I imagine that it will do so by eliciting a particular useful persona, which I have no reason to think will be more or less agentic than the one we got via RLHF.

Interested to hear if you have other intuitions here.

• I wasn’t really focusing on the RL part of RLHF in making the claim that it makes the “agentic personas” problem worse, if that’s what you meant. I’m pretty on board with the idea that the actual effects of using RL as opposed to supervised fine-tuning won’t be apparent until we use stronger RL or something. Then I expect we’ll get even weirder effects, like separate agentic heads or the model itself becoming something other than a simulator (which I discuss in a section of the linked post).

My claim is pretty similar to how you put it—in RLHF as in fine-tuning of the kind relevant here, we’re focusing the model onto outputs that are generated by better agentic persona. But I think that the effect is particuarly salient with RLHF because it’s likely to be scaled up more in the future, where I expect said effect to be exacerbated. I agree with the rest of it, that prompt engineering is unlikely to produce the same effect, and definitely not the same qualitative shift of the world prior.

• Glad to see both the OP as well as the parent comment.

I wanted to clarify something I disagreed with in the parent comment as well as in a sibling comment from Sam Marks about the Anthropic paper “Discovering Language Model Behaviors with Model-Written Evaluations” (paper, post):

Another reason for not liking RLHF that’s somewhat related to the Anthropic paper you linked: because most contexts RLHF is used involve agentic simulacra, RLHF focuses the model’s computation on agency in some sense. My guess is that this explains to an extent the results in that paper—RLHF’d models are better at focusing on simulating agency, agency is correlated with self-preservation desires, and so on.

1) My best guess about why Anthropic’s model expressed self-preservation desires is the same as yours: the model was trying to imitate some relatively coherent persona, this persona was agentic, and so it was more likely to express self-preservation desires.

Both of these points seem to suggest that the main takeaway from the Anthropic paper was to uncover concerning behaviours in RLHF language models. That’s true, but I think it’s just as important that the paper also found pretty much the same concerning behaviours in plain pre-trained LLMs that did not undergo RLHF training, once those models were scaled up to a large enough size.

• Thanks!

My take on the scaled-up models exhibiting the same behaviours feels more banal—larger models are better at simulating agentic processes and their connection to self-preservation desires etc, so the effect is more pronounced. Same cause, different routes getting there with RLHF and scale.

• This, broadly-speaking, is also my best guess, but I’d rather phrase it as: larger LMs are better at making the personas they imitate “realistic” (in the sense of being more similar to the personas you encounter when reading webtext). So doing RLHF on a larger LM results in getting an imitation of a more realistic useful persona. And for the helpful chatbot persona that Anthropic’s language model was imitating, one correlate of being more realistic was preferring not to be shut down.

(This doesn’t obviously explain the results on sycophancy. I think for that I need to propose a different mechanism, which is that larger LMs were better able to infer their interlocutor’s preferences, so that sycophancy only became possible at larger scales. I realize that to the extent this story differs from other stories people tell to explain Anthropic’s findings, that means this story gets a complexity penalty.)

• Models that have been RLHF’d (so to speak), have different world priors in ways that aren’t really all that intuitive (see Janus’ work on mode collapse

Janus’ post on mode collapse is about text-davinci-002, which was trained using supervised fine-tuning on high-quality human-written examples (FeedME), not RLHF. It’s evidence that supervised fine-tuning can lead to weird output, not evidence about what RLHF does.

I haven’t seen evidence that RLHF’d text-davinci-003 appears less safe compared to the imitation-based text-davinci-002.

• Refer my other reply here. And as the post mentions, RLHF also does exhibit mode collapse (check the section on prior work).

• Similar points regarding safety of pure imitation learning vs reinforcement learning have been raised by many others on LW. So I’m really interested what Paul has to say about this.

• I haven’t engaged with this much, though I’ve e.g. talked with Evan some about why I’m not as excited about conditioning generative models as a strategy. I’m happy to engage with particular arguments but feel like I don’t really know what argument is being made by the parent (or most of the other places I’ve seen this in passing).

I think there is a simple reason imitation is safer: the model won’t deliberately produce actions that the demosntrator wouldn’t, whereas RLHF may produce actions that are very creative ways to get reward and may be hamful.

I don’t think this is what people are talking about though (and it wouldn’t work for their broader arguments). I think they are imagining a higher probability of deceptive alignment and other generalization problems.

I don’t thinks I know the precise articulation of these concerns or the argument for it.

On the empirics, sometimes people mention this paper and the RLHF’d model behavior “hey do you want to be shut down? --> no” as evidence of a higher probability of deceptive alignment from RLHF. I don’t really think that’s a reasonable interpretation of the evidence but if that’s a large part of the argument people are making I’d be happy to engage on it.

• As one of the people who’s raised such points, I should note that they mostly apply to applications of language models qua language models (which Jozdien correctly does), and that different techniques can be appropriate for different domains.

• RLHF is just not that important to the bottom line right now. Imitation learning works nearly as well, other hacky techniques can do quite a lot to fix obvious problems, and the whole issue is mostly second order for the current bottom line. RLHF is increasingly important as time goes on, but it also becomes increasingly overdetermined that people would have done it. In general I think your expectation should be that incidental capabilities progress from safety research is a small part of total progress, given that it’s a small fraction of people, very much not focused on accelerating things effectively, in a domain with diminishing returns to simultaneous human effort. This can be overturned by looking at details in particular cases, but I think safety people making this argument mostly aren’t engaging with details in a realistic way.

I think this argument, if true, mostly says that your work on RLHF must have been net-neutral, because people would have done RLHF even if nobody did it for the purposes of alignment. If false, then RLHF was net-negative because of its capabilities externalities. I also don’t buy your argument about relative numbers of people working on capabilities versus alignment. Yes, more people are in the ML field than the alignment field, but the vast majority of the ML field is not so concerned about AGI, and more concerned about making local progress. It is also far easier to make progress on capabilities than alignment, especially when you’re not trying to make progress on alignment’s core problems, and instead trying to get very pretty lines on graphs so you can justify your existence to your employer. It also, empirically, just seems weird that GPT and RLHF were both developed as alignment strategies, yet have so many uses in capabilities.

I also note that strategies like

Creating in vitro examples of problems analogous to the ones that will ultimately kill us, e.g. by showing agents engaging in treacherous turns due to reward hacking or exhibiting more and more of the core features of deceptive alignment.

are the same arguments used to justify working on gain-of-function research. This is not a knock-down criticism of these kinds of arguments, but I do note we should expect similar failure-modes, and not enough people are sufficiently pessimistic when it comes to analyzing failure-modes of their plans. In particular, this kills us in worlds where RLHF does in fact mostly just work, we don’t get an intelligence explosion, and we do need to worry about misuse risks, or the equivalent of AGI “lab-leaks”. I think such worlds are unlikely, but I also think most of the benefits of such work only occur in such worlds. Where treacherous turns in powerful and not so powerful systems occur for the same reasons, we should expect treacherous turns in not so powerful agents before they go FOOM, and we’ll have lots of time to iterate on such failures before we make more capable AGIs. I’m skeptical of such work leading to good alignment work or a slow-down in capabilities in worlds where these properties do not hold. You likely won’t convince anyone of the problem because they’ll see you advocating for us living in one world yet showing demonstrations which are only evidence of doom in a different world, and if you do they’ll work on the wrong problems.

• I think this argument, if true, mostly says that your work on RLHF must have been net-neutral, because people would have done RLHF even if nobody did it for the purposes of alignment.

Doing things sooner and in a different way matters.

This argument is like saying that scaling up language models is net-neutral for AGI, because people would have done it anyway for non-AGI purposes. Doing things sooner matters a lot. I think in most of science and engineering that’s the main kind of effect that anything has.

If false, then RLHF was net-negative because of its capabilities externalities.

No, if false then it has a negative effect which must be quantitatively compared against positive effects.

Most things have some negative effects (e.g. LW itself).

It is also far easier to make progress on capabilities than alignment

This doesn’t seem relevant—we were asking how large an accelerating effect alignment researchers have relative to capabilities researchers (since that determines how many days of speed-up they cause), so if capabilities progress is easier then that seems to increase both numerator and denominator.

especially when you’re not trying to make progress on alignment’s core problems, and instead trying to get very pretty lines on graphs so you can justify your existence to your employer.

To the extent this is a claim about my motivations, I think it’s false. (I don’t think it should look especially plausible from the outside given the overall history of my life.)

As a claim about what matters to alignment and what is “core” it’s merely totally unjustified.

It also, empirically, just seems weird that GPT and RLHF were both developed as alignment strategies, yet have so many uses in capabilities.

This is false, so it makes sense it would seem weird!

• I also note that strategies like

Creating in vitro examples of problems analogous to the ones that will ultimately kill us, e.g. by showing agents engaging in treacherous turns due to reward hacking or exhibiting more and more of the core features of deceptive alignment.

are the same arguments used to justify working on gain-of-function research. This is not a knock-down criticism of these kinds of arguments, but I do note we should expect similar failure-modes, and not enough people are sufficiently pessimistic when it comes to analyzing failure-modes of their plans.

I think that there are many kinds of in vitro failures that don’t pose any lab leak risk. For example, training models against weak overseers and observing the dynamics when they try to overpower those overseers, doesn’t have any effect on increasing takeover risk. Similarly, the kinds of toy models of deceptive alignment we would build don’t increase the probability of deceptive alignment.

I think this kind of work is pretty much essential to realistic stories for how alignment actually makes progress or how we anticipate alignment failures.

Where treacherous turns in powerful and not so powerful systems occur for the same reasons, we should expect treacherous turns in not so powerful agents before they go FOOM, and we’ll have lots of time to iterate on such failures before we make more capable AGIs

This seems wrong. For example, you can get treacherous turns in weak systems if you train them with weak overseers, or if you deliberately take actions that make in vitro treacherous turns more likely, without automatically getting such failures if you are constantly doing your best to make your AIs behave well.

I’m skeptical of such work leading to good alignment work or a slow-down in capabilities in worlds where these properties do not hold. You likely won’t convince anyone of the problem because they’ll see you advocating for us living in one world yet showing demonstrations which are only evidence of doom in a different world, and if you do they’ll work on the wrong problems.

I completely disagree. I think having empirical examples of weak AIs overpowering weak overseers, even after a long track record of behaving well in training, would be extremely compelling to most ML researchers as a demonstration that stronger AIs might overpower stronger overseers, even after a long track record of behaving well in training. And whether or not it was persuasive, it would be extremely valuable for doing actually productive research to detect and correct such failures.

(The deceptive alignment story is more complicated, and I do think it’s less of a persuasive slam dunk, though I still think it’s very good for making the story 90% less speculative and creating toy settings to work on detection and correction.)

In particular, this kills us in worlds where RLHF does in fact mostly just work, we don’t get an intelligence explosion, and we do need to worry about misuse risks, or the equivalent of AGI “lab-leaks”.

I don’t think that most of the work in this category meaningfully increases the probability of lab leaks or misuse (again, the prototypical example is a weak AI overpowering a weak overseer).

That said, I am also interested in work that does have real risks, for example understanding how close AI systems are to dangerous capabilities by fine-tuning them for similar tasks. In these cases I think taking risks seriously is important. But (as with gain-of-function research on viruses) I think the question ultimately comes down to a cost-benefit analysis. In this case it seems possible to do the research in a way with relatively low risk, and the world where “AI systems would be catastrophic if they decided to behave badly, but we never checked” is quite a bad world that you had a good chance of avoiding by doing such work.

I think it’s reasonable to expect people to underestimate risks of their own work via attachment to it and via selection (whoever is least concerned does it), so it seems reasonable to have external accountability and oversight for this and to be open to people making arguments that risks are underestimated.

• GPT [was] developed as alignment strategy.

Really? How so?

• I don’t know all the details, but the idea was that a thing that mimics humans and was capable would be safer than a thing that did lots of RL in a range of tasks and was powerful, so the creator of the architecture worked on improving text generation.

• I don’t think this is true. Transformers were introduced by normal NLP researchers at Google. Generative pre-training is a natural thing to do with them, introduced at OpenAI by Alec Radford (blog post here) with no relationship to alignment.

• I just looked into it, it turns out you’re right. I think I was given a misimpression of the motivations here due to much OpenAI research at the time being vaguely motivated by “lets make AGI, and lets make it good”, but it was actually largely divorced from modern alignment considerations.

• And this is actually pretty reasonable as a strategy, given their general myopia by default and their simulator nature playing well with alignment ideas like HCH. If we could avoid a second optimizer arising, then this scaled up would be nearly ideal for automated research on say alignment. But RLHF ruined it, and this was IMO a good example of a looking good alignment strategy that wasn’t actually good.

• But RLHF ruined it

I’m not quite clear on what you are saying here. If conditioning generative models is a reasonably efficient way to get work out of an AI, we can still do that. Unfortunately it’s probably not an effective way to build an AI, and so people will do other things. You can convince them that other things are less safe and then maybe they won’t do other things.

Are you saying that maybe no one would have thought of using RL on language models, and so we could have gotten a way with a world where we used AI inefficiently because we didn’t think of better ideas? In my view (based e.g. on talking a bunch to people working at OpenAI labs prior to me working on RLHF) that was never remotely plausible outcome.

ETA: also just to be clear I think that this (the fictional strategy of developing GPT so that future AIs won’t be agents) would be a bad strategy, vulnerable to 10-100x more compelling versions of the legitimate objections being raised in the comments.

• Basically, I’m talking about how RLHF removed a very valuable property called myopia. If you had myopia by default, like say the GPT series of simulators, then you just had to apply the appropriate decision theory like LCDT, and the GPT series of simulators could do something like HCH or IDA on real life. But RLHF removed myopia, and thus deceptive alignment and mesa optimization is possible, arguably incentivized under a non-myopic scheme. This is probably harder to solve than having a non-agentic system alignment problem.

https://​​www.lesswrong.com/​​posts/​​yRAo2KEGWenKYZG9K/​​discovering-language-model-behaviors-with-model-written

Now you do mention that RLHF is more capable, and yeah that is sort of depressing that the most capable models align well with the most deceptive models.

• I don’t think GPT has the sense of myopia relevant to deceptive alignment any more or less than models fine-tuned with RLHF. There are other bigger impacts of RLHF both for the quoted empirical results and for the actual probability of deceptive alignment, and I think the concept is being used in a way that is mostly incoherent.

But I was mostly objecting to the claim that RLHF ruined [the strategy]. I think even granting the contested empirics it doesn’t quite make sense to me.

• Sorry to respond late, but a crux I might have here is that I see the removal of myopia and the addition of agency/​non-causal decision theories as a major negative of an alignment plan by default, and without specific mechanisms of how deceptive alignment/​mesa optimizers can’t arise, I expect non-myopic training to find such things.

In general, the fact that OpenAI chose RLHF made the problem quite harder, and I suspect this is an example of Goodhart’s law in action.

The Recursive Reward Modeling and debate plans could make up for this, assuming we can solve deceptive alignment. But right now, I see trouble ahead and OpenAI is probably going to be bailed out by other alignment groups.

• Why should we think of base GPT as myopic, such that “non-myopic training” can remove that property? Training a policy to imitate traces of “non-myopic cognition” in the first place seems like a way to plausibly create a policy that itself has “non-myopic cognition”. But this is exactly how GPT pretraining works.

• Huh, I’d not heard that, would be interested in hearing more about the thought process behind its development.

Think they could well turn out to be correct in that having systems with such a strong understanding of human concepts gives us levers we might not have had, though code-writing proficiency is a very unfortunate development.

• Creating in vitro examples of problems analogous to the ones that will ultimately kill us, e.g. by showing agents engaging in treacherous turns due to reward hacking or exhibiting more and more of the core features of deceptive alignment.

A central version of this seems to straightforwardly advance capabilities. The strongest (ISTM) sort of analogy between a current system and a future lethal system would be that they use an overlapping set of generators of capabilities. Trying to find an agent that does a treacherous turn, for the same reasons as a future lethal agent, seems to be in particular a search for an agent that has the same generators of capabilities as future lethal agents. On the other hand, trying to prevent treacherous turns in a system that has different generators seems like it doesn’t have much chance of generalizing.

It seems clear that one could do useful “advertising” (better term?) research of this form, where one makes e.g. treacherous turns intuitively salient to others by showing something with some features in common with future lethal ones. E.g. one could train an agent A in an environment that contains the source B of A’s reward, where B does some limited search to punish actions by A that seem, to the limited search, to be building up towards A hacking B. One might find that A does well according to B for a while, until it’s understood the environment well enough (via exploration that didn’t look to B like hacking) to plan, recognize as high reward, and follow a pathway to hack B. Or something. This could be helpful for “advertising” reasons, but I think my sense of how much this actually helps with the actual alignment problem correlates pretty strongly with how much A is shaped—in terms of how it got its capabilities—alike to future lethal systems. What are ways that the helpfulness for alignment of an observational study like this can be pulled apart from similarity of capability generators?

• The main way you produce a treacherous turn is not by “finding the treacherous turn capabilities,” it’s by creating situations in which sub-human systems have the same kind of motive to engage in a treacherous turn that we think future superhuman systems might have.

This could be helpful for “advertising” reasons, but I think my sense of how much this actually helps with the actual alignment problem correlates pretty strongly with how much A is shaped—in terms of how it got its capabilities—alike to future lethal systems. What are ways that the helpfulness for alignment of an observational study like this can be pulled apart from similarity of capability generators?

There are some differences and lots of similarities between what is going on in a weaker AI doing a treacherous turn and a stronger AI doing a treacherous turn. So you expect to learn some things and not others. After studying several such cases it seems quite likely you understand enough to generalize to new cases.

It’s possible MIRI folks expect a bigger difference in how future AI is produced. I mostly expect just using gradient descent, resulting in minds that are in some ways different and in many ways different. My sense is that MIRI folks have a more mystical view about the difference between subhuman AI systems and “AGI.”

(The view “stack more layers won’t ever give you true intelligence, there is a qualitative difference here” seems like it’s taking a beating every year, whether it’s Eliezer or Gary Marcus saying it.)

• The main way you produce a treacherous turn is not by “finding the treacherous turn capabilities,” it’s by creating situations in which sub-human systems have the same kind of motive to engage in a treacherous turn that we think future superhuman systems might have.

When you say “motive” here, is it fair to reexpress that as: “that which determines by what method and in which directions capabilities are deployed to push the world”? If you mean something like that, then my worry here is that motives are a kind of relation involving capabilities, not something that just depends on, say, the reward structure of the local environment. Different sorts of capabilities or generators of capabilities will relate in different ways to ultimate effects on the world. So the task of interfacing with capabilities to understand how they’re being deployed (with what motive), and to actually specify motives, is a task that seems like it would depend a lot on the sort of capability in question.

• I think if you train AI systems to select actions that will lead to high reward, they will sometimes learn policies that behave well until they are able to overpower their overseers, at which point they will abruptly switch to the reward hacking strategy to get a lot of reward.

I think there will be many similarities between this phenomenon in subhuman systems and superhuman systems. Therefore by studying and remedying the problem for weak systems overpowering weak overseers, we can learn a lot about how to identify and remedy it for stronger systems overpowering stronger overseers.

I’m not exactly sure how to cash out your objection as a response to this, but I suspect it’s probably a bit too galaxy-brained for my taste.

• So for example, say Alice runs this experiment:

Train an agent A in an environment that contains the source B of A’s reward.

Alice observes that A learns to hack B. Then she solves this as follows:

Same setup, but now B punishes (outputs high loss) A when A is close to hacking B, according to a dumb tree search that sees whether it would be easy, from the state of the environment, for A to touch B’s internals.

Alice observes that A doesn’t hack B. The Bob looks at Alice’s results and says,

“Cool. But this won’t generalize to future lethal systems because it doesn’t account for how A can combine innocuous understanding that it gains. Future systems, to be very competent, will probably do something functionally equivalent to exploring their environment to understand parts of the environment without necessarily trying to achieve some big goal (such as hacking B) along the way. This creates a ‘capabilities overhang’ relative to the overseer: there’s no behavior that’s clearly aimed at something B considers dangerous, but A accumulates ability to put together plans that do more and more effective stuff, compared to what A has actually previously acted out and gotten direct reinforcement on. This is an important part of how future systems might be lethal.”

So then Alice and Bob collaborate and come up with this variation:

Same setup, but now A also has an exploration objective (e.g. it’s rewarded for reaching states for the first time, or for gaining information, or for a randomly sampled goal from some rich distribution on goals).

Alice and Bob observe that A avoids approaching B for a long time while steadily improving both its B-score and also its exploration score. Then at some point, all in one episode, A hacks B and achieves very high reward.

Now, this might be interesting from an alignment perspective, or not. But my point is that Alice and Bob have perhaps, in some version of the hypothetical, also made a capabilities advance: they’ve demonstrated non-trivial gains from an exploration objective. I assume that in our world this is not much of an insight, as exploration objectives have already been discussed and tried. But this is the sort of pattern that’s concerning to me.

I’m not saying one can’t do this sort of thing in a way such that the alignment value exceeds the capabilities advancement in the relevant way. I’m saying, these things seem to push pretty directly against each other, so I’d want careful thinking about how to pull them apart. Even instances that don’t come up with new ideas, but just demonstrate “hey actually this method is powerful”, would seem to advance capabilities non-trivially.

• Avoiding RLHF at best introduces an important overhang

But it would be better if we collectively then decided not to rush forward anyway, right?

And I still don’t get why do you expect the future environment, where somewhat-aligned superhuman AIs are available, to be better for alignment work. Like, sure, automatic idea generator and verifier may be useful, but it’s also useful for reckless people. And, intuitively, the more advanced AI is, the less I would trust it. So “lets try as hard as we can to advance civilization, because more advanced civilization will be better at alignment” seem like a very risky plan.

• But it would be better if we collectively then decided not to rush forward anyway, right?

Yes, that seems consistent with my post.

And I still don’t get why do you expect the future environment, where somewhat-aligned superhuman AIs are available, to be better for alignment work. Like, sure, automatic idea generator and verifier may be useful, but it’s also useful for reckless people. And, intuitively, the more advanced AI is, the less I would trust it.

I mostly think that AI doing research will accelerate both risk and alignment, so we’re aiming for it to be roughly a wash.

But having nearly-risky AI to study seems incredibly important for doing good alignment work. I think this is a pretty robust bet.

So “lets try as hard as we can to advance civilization, because more advanced civilization will be better at alignment” seem like a very risky plan.

That’s not the plan. I’m saying to do the work that seems most useful for alignment even if it has modest capability benefits, and that for some kinds of capability benefits the apparent cost is less than you’d think because of these overhang effects.

• I mostly think that AI doing research will accelerate both risk and alignment, so we’re aiming for it to be roughly a wash.

Yeah, I don’t understand why it would be a wash, when destructive capabilities are easier than alignment (humans already figured out nukes, but not alignment) and alignment is expected to be harder for more advanced AI. Even without straight misalignment risk, giving superhuman AI to the current civilization doesn’t sound like stability improvement. So without specific plan to stop everyone from misusing AI it still sounds safer to solve alignment without anyone building nearly-risky AI.

• A lot of historical work on alignment seems like it addresses subsets of the problems solved by RLHF, but doesn’t actually address the important ways in which RLHF fails. In particular, a lot of that work is only necessary if RLHF is prohibitively sample-inefficient.

Do you have examples of such historical work that you’re happy to name? I’m really unsure what you’re referring to (probably just because I haven’t been involved in alignment for long enough).

• I think a lot of work on IRL and similar techniques has this issue—it’s mostly designed to learn from indirect forms of evidence about value, but in many cases the primary upside is data efficiency and in fact the inferences about preferences are predictably worse than in RLHF.

(I think you can also do IRL work with a real chance of overcoming limitations of RLHF, but most researchers are not careful about thinking through what should be the central issue.)

• For example, a major point of disagreement between me and Eliezer is that Eliezer often dismisses plans as “too complicated to work in practice,” but that dismissal seems divorced from experience with getting things to work in practice (e.g. some of the ideas that Eliezer dismisses are not much more complex than RLHF with AI assistants helping human raters). In fact I think that you can implement complex things by taking small steps—almost all of these implementation difficulties do improve with empirical feedback.

EY’s counter to this?

• I have read through most of this post and some of the related discussion today. I just wanted to write that it was really interesting, and as far as I can tell, useful, to think through Paul’s reasoning and forecasts about strategy-related questions.
In case he believes this is a good idea, I would be very glad to read through a longer, more comprehensive document describing his views on strategic considerations.

• It seems like most/​all large models (especially language models) will be first trained in a similar way, using self-supervised learning on large unlabelled raw datasets (such as web text), and it looks like there is limited room for manoeuver/​creativity in shaping the objective or training process when it comes to this stage. Fundamentally, this stage is just about developing a really good compression algorithm for all the training data.

The next stage, when we try and direct the model to perform a certain task (either trivially, via prompting, or via fine-tuning from human preference data, or something else) seems to be where most of the variance in outcomes/​safety will come in, at least in the current paradigm. Therefore, I think it could be worth ML safety researchers focusing on analyzing and optimizing this second stage as a way of narrowing the problem/​experiment space. I think mech interp focused on the reward model used in RLHF could be an interesting direction here.

• Creating in vitro examples of problems analogous to the ones that will ultimately kill us, e.g. by showing agents engaging in treacherous turns due to reward hacking or exhibiting more and more of the core features of deceptive alignment.

Has ARC got a written policy for if/​when similar experiment generate inconclusive but possible evidence of dangerous behaviour.

If so would you consider sharing it (or a non-confidential version) for other organisations to use.

• (While I appreciate many of the investigations in this paper and think it is good to improve our understanding, I don’t think they let us tell what’s up with risk.) This could be the subject of a much longer post and maybe will be discussed in the comments.

Do you mean they don’t tell us what’s up with the difference in risks of the measured techniques, or that they don’t tell us much about AI risk in general? (I’d at least benefit from learning more about your views here)

• Yes, I mean that those measurements don’t really speak directly to the question of whether you’d be safer using RLHF or imitation learning.

• techn ical

typo?