Richard_Ngo

Karma: 14,736

Former AI safety research engineer, now AI governance researcher at OpenAI. Blog: thinkingcomplete.com

Richard_Ngo 21 Nov 2022 19:27 UTC
160 points
133
on: Here’s the exit.
I think there’s a bunch of useful stuff here. In particular, I think that decisions driven by deep-rooted fear are often very counterproductive, and that many rationalists often have “emergency mobilization systems” running in ways which aren’t conducive to good long-term decision-making. I also think that paying attention to bodily responses is a great tool for helping fix this (and in fact was helpful for me in defusing annoyance when reading this post). But I want to push back on the way in which it’s framed in various places as an all-or-nothing: exit the game, or keep playing. Get sober, or stay drunk. Hallucination, not real fear.
In fact, you can do good and important work while also gradually coming to terms with your emotions, trying to get more grounded, and noticing when you’re making decisions driven by visceral fear and taking steps to fix that. Indeed, I expect that almost all good and important work throughout history has been done by people who are at various stages throughout that process, rather than people who first dealt with their traumas and only then turned to the work. (EDIT: in a later comment, Valentine says he doesn’t endorse the claim that people should deal with traumas before doing the work, but does endorse the claim that people should recognize the illusion before doing the work. So better to focus on the latter (I disagree with both).)
(This seems more true for concrete research, and somewhat (but less) true for thinking about high-level strategy. In general it seems that rationalists spend way too much of their time thinking about high-level strategic considerations, and I agree with some of Valentine’s reasoning about why this happens. Instead I’d endorse people trying be much more focused on making progress in a few concrete areas, rather than trying to track everything which they think might be relevant to AI risk. E.g. acceleration is probably bad, but it’s fundamentally a second-order effect, and the energy focused on all but the biggest individual instances of acceleration would probably be better used to focus on first-order effects.)
In other words, I want to offer people the affordance to take on board the (many) useful parts of Valentine’s post without needing to buy into the overall frame in which your current concerns are just a game, and your fear is just a manifestation of trauma.
(Relatedly, from my vantage point it seems that “you need to do the trauma processing first and only then do useful work” is a harmful self-propagating meme in a very similar way as “you need to track and control every variable in order for AI to go well”. Both identify a single dominant consideration which requires your full focus and takes precedence over all others. However, I still think that the former is directionally correct for most rationalists, just as the latter is directionally correct for most non-rationalists.)

Richard_Ngo 19 Jun 2022 21:04 UTC
LW: 153 AF: 38
95
AF
on: Where I agree and disagree with Eliezer
Strong +1s to many of the points here. Some things I’d highlight:
1. Eliezer is not doing the type of reasoning that can justifiably defend the level of confidence he claims to have. If he were, he’d have much more to say about the specific details of consequentialism, human evolution, and the other key intuitions shaping his thinking. In my debate with him he mentioned many times how difficult he’s found it to explain these ideas to people. I think if he understood these ideas well enough to justify the confidence of his claims, then he wouldn’t have found that as difficult. (I’m sympathetic about Eliezer having in the past engaged with many interlocutors who were genuinely very bad at understanding his arguments. However, it does seem like the lack of detail in those arguments is now a bigger bottleneck.)
2. I think that the intuitions driving Eliezer’s disagreements with many other alignment researchers are interesting and valuable, and would love to have better-fleshed-out explanations of them publicly available. Eliezer would probably have an easier time focusing on developing his own ideas if other people in the alignment community who were pessimistic about various research directions, and understood the broad shape of his intuitions, were more open and direct about that pessimism. This is something I’ve partly done in this post; and I’m glad that Paul’s partly done it here.
3. I like the analogy of a mathematician having intuitions about the truth of a theorem. I currently think of Eliezer as someone who has excellent intuitions about the broad direction of progress at a very high level of abstraction—but where the very fact that these intuitions are so abstract rules out the types of path-dependencies that I expect solutions to alignment will actually rely on. At this point, people who find Eliezer’s intuitions compelling should probably focus on fleshing them out in detail—e.g. using toy models, or trying to decompose the concept of consequentialism—rather than defending them at a high level.
What links here?

Richard_Ngo 21 Jan 2022 3:03 UTC
136 points
2
on: What’s Up With Confusingly Pervasive Consequentialism?
Suppose we took this whole post and substituted every instance of “cure cancer” with the following:
Version A: “win a chess game against a grandmaster”
Version B: “write a Shakespeare-level poem”
Version C: “solve the Riemann hypothesis”
Version D: “found a billion-dollar company”
Version E: “cure cancer”
Version F: “found a ten-trillion-dollar company”
Version G: “take over the USA”
Version H: “solve the alignment problem”
Version I: “take over the galaxy”
And so on. Now, the argument made in version A of the post clearly doesn’t work, the argument in version B very likely doesn’t work, and I’d guess that the argument in version C doesn’t work either. Suppose I concede, though, that the argument in version I works: that searching for an oracle smart enough to give us a successful plan for taking over the galaxy will very likely lead us to develop an agentic, misaligned AGI. Then that still leaves us with the question: what about versions D, E, F, G and H? The argument is structurally identical in each case—so what is it about “curing cancer” that is so hard that, unlike winning chess or (possibly) solving the Riemann hypothesis, when we train for that we’ll get misaligned agents instead?
We might say: well, for humans, curing cancer requires high levels of agency. But humans are really badly optimised for many types of abstract thinking—hence why we can be beaten at chess so easily. So why can’t we also be beaten at curing cancer by systems less agentic than us?
Eliezer has a bunch of intuitions which tell him where the line of “things we can’t do with non-dangerous systems” should be drawn, which I freely agree I don’t understand (although I will note that it’s suspicious how most people can’t do things on the far side of his line, but Einstein can). But insofar as this post doesn’t consider which side of the line curing cancer is actually on, then I don’t think it’s correctly diagnosed the place where Eliezer and I are bouncing off each other.
What links here?

Richard_Ngo 6 Mar 2024 20:40 UTC
LW: 129 AF: 69
74
AF
on: ricraz’s Shortform
I feel kinda frustrated whenever “shard theory” comes up in a conversation, because it’s not a theory, or even a hypothesis. In terms of its literal content, it basically seems to be a reframing of the “default” stance towards neural networks often taken by ML researchers (especially deep learning skeptics), which is “assume they’re just a set of heuristics”.

This is a particular pity because I think there’s a version of the “shard” framing which would actually be useful, but which shard advocates go out of their way to avoid. Specifically: we should be interested in “subagents” which are formed via hierarchical composition of heuristics and/or lower-level subagents, and which are increasingly “goal-directed” as you go up the hierarchy. This is an old idea, FWIW; e.g. it’s how Minsky frames intelligence in Society of Mind. And it’s also somewhat consistent with the claim made in the original shard theory post, that “shards are just collections of subshards”.

The problem is the “just”. The post also says “shards are not full subagents”, and that “we currently estimate that most shards are ‘optimizers’ to the extent that a bacterium or a thermostat is an optimizer.” But the whole point of thinking about shards, in my mind, is that it allows us to talk about a gradual spectrum from “heuristic” to “agent”, and how the combination of low-level heuristics may in fact give rise to high-level agents which pursue consequentialist goals. I talk about this in my post on value systematization—e.g. using the example of how normal human moral “shards” (like caring about other people’s welfare) can aggregate into highly-consequentialist utilitarian subagents. In other words, shard advocates seem so determined to rebut the “rational EU maximizer” picture that they’re ignoring the most interesting question about shards—namely, how do rational agents emerge from collections of shards?

(I make a similar point in the appendix of my value systematization post.)

Richard_Ngo 18 Aug 2023 0:16 UTC
LW: 118 AF: 49
15
AF
on: Against Almost Every Theory of Impact of Interpretability
Strong disagree. This seems like very much the wrong type of reasoning to do about novel scientific research. Big breakthroughs open up possibilities that are very hard to imagine before those breakthroughs (e.g. imagine trying to describe the useful applications of electricity before anyone knew what it was or how it worked; or imagine Galileo trying to justify the practical use of studying astronomy).
Interpretability seems like our clear best bet for developing a more principled understanding of how deep learning works; this by itself is sufficient to recommend it. (Algon makes a similar point in another comment.) Though I do agree that, based on the numbers you gave for how many junior researchers’ projects are focusing on interpretability, people are probably overweighting it.
I think this post is an example of a fairly common phenomenon where alignment people are too focused on backchaining from desired end states, and not focused enough on forward-chaining to find the avenues of investigation which actually allow useful feedback and give us a hook into understanding the world better. (By contrast, most ML researchers are too focused on the latter.)
Perhaps the main problem I have with interp is that it implicitly reinforces the narrative that we must build powerful, dangerous AIs, and then align them. For X-risks, prevention is better than cure. Let’s not build powerful and dangerous AIs. We aspire for them to be safe, by design.
I particularly disagree with this part. The way you get safety by design is understanding what’s going on inside the neural networks. More generally, I’m strongly against arguments of the form “we shouldn’t do useful work, because then it will encourage other people to do bad things”. In theory arguments like these can sometimes be correct, but in practice perfect is often the enemy of the good.

Richard_Ngo 4 Nov 2023 7:22 UTC
87 points
29
on: Integrity in AI Governance and Advocacy
A couple of quick, loosely-related thoughts:
- I think the heuristic “people take AI risk seriously in proportion to how seriously they take AGI” is a very good one. There are some people who buck the trend (e.g. Stuart Russell, some accelerationists), but it seems broadly true (e.g. Hinton and Bengio started caring more about AI risk after taking AGI more seriously). This should push us towards thinking that the current wave of regulatory interest wasn’t feasible until after ChatGPT.
- I think that DC people were slower/more cautious about pushing the Overton Window after ChatGPT than they should have been. I think they should update harder from this mistake than they’re currently doing (e.g. updating that they’re too biased towards conventionality). There probably should be at least one person with solid DC credentials who’s the public rallying point for “I am seriously worried about AI takeover”.
- I think that “doomers” were far too pessimistic about governance before ChatGPT (in ways that I and others predicted beforehand, e.g. in discussions with Ben and Eliezer). I think they should update harder from this mistake than they’re currently doing (e.g. updating that they’re too biased towards inside-view models and/or fast takeoff and/or high P(doom)).
- I think that the AI safety community in general (including myself) was too pessimistic about OpenAI’s strategy of gradually releasing models (COI: I work at OpenAI), and should update more on that mistake.
- I think there’s a big structural asymmetry where it’s hard to see the ways in which DC people are contributing to big wins (like AI executive orders), and they can’t talk about it, and the value of this work (and the tradeoffs they make as part of that) is therefore underestimated.
- “One of the biggest conspiracies of the last decade” doesn’t seem right. The amount of money/influence involved in FTX is dwarfed by the amount of money/influence thrown around by governments in general, and it’s easier for factions within governments to enforce secrecy than for corporations to do so. More concretely, I’d say that there were probably several different “conspiratorial” things related to covid in various countries that had much bigger effects; probably several more related to ongoing Russia-Ukraine and Israel-Palestine conflicts; probably several more Trump/Biden-related things; maybe some to do with culture-war stuff; probably a few more prosaic fraud or corruption things that stole tens of billions of dollars, just less publicly (e.g. from big government contracts); a bunch of criminal gangs which also have far more money than FTX did; and almost certainly a bunch that don’t fall into any of those categories. (For example, if the CIA is currently doing any stuff comparable to its historical record of messing around with South American countries, that’s plausibly far bigger than FTX. Or various NSA surveillance type things are likely a much bigger deal, in terms of impact, than FTX. Oh, and stuff like NotPetya should probably count too.)
- There’s at least one case where I hesitated to express my true beliefs publicly because I was picturing Conjecture putting the quote up on the side of a truck. I don’t know how much I endorse this hesitation, but it’s definitely playing some role in my decisions, and I expect will continue to do so.
What links here?
- AI #37: Moving Too Fast by Zvi (9 Nov 2023 17:50 UTC; 53 points)

Richard_Ngo 28 Aug 2022 20:07 UTC
LW: 84 AF: 31
27
AF
in reply to: johnswentworth’s comment on: Common misconceptions about OpenAI
I take this comment as evidence that John would fail an intellectual turing test for people who have different views than he does about how valuable incremental empiricism is. I think this is an ITT which a lot of people in the broader LW cluster would fail. I think the basic mistake that’s being made here is failing to recognize that reality doesn’t grade on a curve when it comes to understanding the world—your arguments can be false even if nobody has refuted them. That’s particularly true when it comes to very high-level abstractions, like the ones this field is built around (and in particular the abstraction which it seems John is using, where making progress on outer alignment makes almost no difference to inner alignment).
Historically, the way that great scientists have gotten around this issue is by engaging very heavily with empirical data (like Darwin did) or else with strongly predictive theoretical frameworks (like Einstein did). Trying to do work which lacks either is a road with a lot of skulls on it. And that’s fine, this might be necessary, and so it’s good to have some people pushing in this direction, but it seems like a bunch of people around here don’t just ignore the skulls, they seem to lack any awareness that the absence of the key components by which scientific progress has basically ever been made is a red flag at all.
I think it’s possible to criticise work on RLHF while taking seriously the possibility that empirical work on our biggest models is necessary for solving alignment. But criticisms like this one seem to showcase a kind of blindspot. I’d be more charitable if people in the LW cluster had actually tried to write up the arguments for things like “why inner misalignment is so inevitable”. But in general people have put shockingly little effort into doing so, with almost nobody trying to tackle this rigorously. E.g. I was surprised when my debates with Eliezer involved him still using all the same intuition-pumps as he did in the sequences, because to me the obvious thing to do over the next decade is to flesh out the underlying mental models of the key issue, which would then allow you to find high-level intuition pumps that are both more persuasive and more trustworthy.
I’m more careful than John about throwing around aspersions on which people are “actually trying” to solve problems. But it sure seems to me that blithely trusting your own intuitions because you personally can’t imagine how they might be wrong is one way of not actually trying to solve hard problems.

Richard_Ngo 12 Nov 2021 0:52 UTC
LW: 84 AF: 33
AF
in reply to: adamShimi’s comment on: Discussion with Eliezer Yudkowsky on AGI interventions
I think one core issue here is that there are actually two debates going on. One is “how hard is the alignment problem?”; another is “how powerful are prosaic alignment techniques?” Broadly speaking, I’d characterise most of the disagreement as being on the first question. But you’re treating it like it’s mostly on the second question—like EY and everyone else are studying the same thing (cancer, in your metaphor) and just disagree about how to treat it.
My attempt to portray EY’s perspective is more like: he’s concerned with the problem of ageing, and a whole bunch of people have come along, said they agree with him, and started proposing ways to cure cancer using prosaic radiotherapy techniques. Now he’s trying to say: no, your work is not addressing the core problem of ageing, which is going to kill us unless we make a big theoretical breakthrough.
Regardless of that, calling the debate “one sided” seems way too strong, especially given how many selection effects are involved. I mean, you could also call the debate about whether alignment is even a problem “one sided” − 95% of all ML researchers don’t think it’s a problem, or think it’s something we’ll solve easily. But for fairly similar meta-level reasons as why it’s good for them to listen to us in an open-minded way, it’s also good for prosaic alignment researchers to listen to EY in an open-minded way. (As a side note, I’d be curious what credence you place on EY’s worldview being more true than the prosaic alignment worldview.)
Now, your complaint might be that MIRI has not made their case enough over the last few years. If that’s the main issue, then stay tuned; as Rob said, this is just the preface to a bunch of relevant material.

Richard_Ngo 20 Aug 2020 14:06 UTC
74 points
4
on: ricraz’s Shortform
One fairly strong belief of mine is that Less Wrong’s epistemic standards are not high enough to make solid intellectual progress here. So far my best effort to make that argument has been in the comment thread starting here. Looking back at that thread, I just noticed that a couple of those comments have been downvoted to negative karma. I don’t think any of my comments have ever hit negative karma before; I find it particularly sad that the one time it happens is when I’m trying to explain why I think this community is failing at its key goal of cultivating better epistemics.
There’s all sorts of arguments to be made here, which I don’t have time to lay out in detail. But just step back for a moment. Tens or hundreds of thousands of academics are trying to figure out how the world works, spending their careers putting immense effort into reading and producing and reviewing papers. Even then, there’s a massive replication crisis. And we’re trying to produce reliable answers to much harder questions by, what, writing better blog posts, and hoping that a few of the best ideas stick? This is not what a desperate effort to find the truth looks like.
What links here?
- If there were an interactive software teaching Yudkowskian rationality, what concepts would you want to see it teach? by MikkW (2 Sep 2020 5:37 UTC; 25 points)
- Richard_Ngo's comment on Deutsch and Yudkowsky on scientific explanation by Richard_Ngo (28 Jan 2021 12:16 UTC; 5 points)

Richard_Ngo 20 Mar 2024 20:50 UTC
68 points
10
on: ricraz’s Shortform
Here is the best toy model I currently have for rational agents. Alas, it is super messy and hacky, but better than nothing. I’ll call it the BAVM model; the one-sentence summary is “internal traders concurrently bet on beliefs, auction actions, vote on values, and merge minds”. There’s little novel here, I’m just throwing together a bunch of ideas from other people (especially Scott Garrabrant and Abram Demski).
In more detail, the three main components are:
1. A prediction market
2. An action auction
3. A value election
You also have some set of traders, who can simultaneously trade on any combination of these three. Traders earn money in two ways:
1. Making accurate predictions about future sensory experiences on the market.
2. Taking actions which lead to reward or increase the agent’s expected future value.
They spend money in three ways:
1. Bidding to control the agent’s actions for the next N timesteps.
2. Voting on what actions get reward and what states are assigned value.
3. Running the computations required to figure out all these trades.
Values are therefore dominated by whichever traders earn money from predictions or actions, who will disproportionately vote for values that are formulated in the same ontologies they use for prediction/action, since that’s simpler than using different ontologies. (Note that this does requires the assumption that simpler traders start off with more money.)
The last component is that it costs traders money to do computation. The way they can reduce this is by finding other traders who do similar computations as them, and then merging into a single trader. I am very interested in better understanding what a merging process like this might look like, though it seems pretty intractable in general because it will depend a lot on the internal details of the traders. (So perhaps a more principled approach here is to instead work top-down, figuring out what sub-markets or sub-auctions look like?)

Richard_Ngo 14 Apr 2021 17:34 UTC
60 points
on: Against “Context-Free Integrity”
There’s something that’s been bugging me lately about the rationalist discourse on moral mazes, political power structures, the NYT/SSC kerfuffle, etc. People are making unusually strong non-consequentialist moral claims without providing concomitantly strong arguments, or acknowledging the ways in which this is a judgement-warping move.

I don’t think that being non-consequentialist is always wrong. But I do think that we have lots of examples of people being blinded by non-consequentialist moral intuitions, and it seems like rationalists around me are deliberately invoking risk factors. Some of the risk factors: strong language, tribalism, deontological rules, judgements about the virtue of people or organisations, and not even trying to tell a story about specific harms.

Your post isn’t a central example of this, but it seems like your argument is closely related to this phenomenon, and there are also a few quotes from your post which directly showcase the thing I’m criticising:

they would be notably more disgusted with the parts of that system they interacted with

And:

They’d rather believe the things around them are pretty good rather than kinda evil. Evil means accounting, and accounting is boooring.

And:

The first time was with Facebook, where he was way in advance of me coming to realize what was evil about it.

“Evil” is one of the most emotionally loaded words in the english language. Disgust is one of the most visceral and powerful emotions. Neither you nor I nor other readers are immune to having our judgement impaired by these types of triggers, especially when they’re used regularly. (Edit: to clarify, I’m not primarily worried about worst-case interpretations; I’m worried about basically everyone involved.)

Now, I’m aware of major downsides of being too critical of strong language and bold claims. But being careful of gratuitously using words like “evil” and “insane” and “Stalinist” isn’t an usually high bar; even most people on Twitter manage it.

Other closely-related examples: people invoking anti-media tribalism in defence of SSC; various criticisms of EA for not meeting highly scrupulous standards of honesty (using words like “lying” and “scam”); talking about “attacks” and “wars”; taking hard-line views on privacy and the right not to be doxxed; etc.

Oh, and I should also acknowledge that my calls for higher epistemic standards are driven to a significant extent by epistemically-deontological intuitions. And I do think this has warped my judgement somewhat, because those intuitions lead to strong reactions to people breaking the “rules”. I think the effect is likely to be much stronger when driven by moral (not just epistemic) intuitions, as in the cases discussed above.

Richard_Ngo 25 Jan 2024 19:51 UTC
59 points
28
in reply to: Thomas Kwa’s comment on: This might be the last AI Safety Camp
I originally found this comment helpful, but have now found other comments pushing back against it to be more helpful. Upon reflection, I don’t think the comparison to MATS is very useful (a healthy field will have a bunch of intro programs), the criticism of Remmelt is less important given that Linda is responsible for most of the projects, the independence of the impact assessment is not crucial, and the lack of papers is relatively unsurprising given that it’s targeting earlier-stage researchers/serving as a more introductory funnel than MATS.

Richard_Ngo 23 Mar 2024 2:43 UTC
53 points
20
on: On green
My favorite section of this post was the “green according to non-green” section, which I felt captured really well the various ways that other colors see past green.
I don’t fully feel like the green part inside me resonated with any of your descriptions of it, though. So let me have a go at describing green, and seeing if that resonates with you.
Green is the idea that you don’t have to strive towards anything. Thinking that green is instrumentally useful towards some other goal misses the whole point of green, which is about getting out of a goal- or action-oriented mindset. When you do that, your perception expands from a tunnel-vision “how can I get what I want” to actually experiencing the world in its unfiltered glory—actually looking at the redwoods. If you do that, then you can’t help but feel awe. And when you step out of your self-oriented tunnel, suddenly the world has far more potential for harmony than you’d previously seen, because in fact the motivations that are causing the disharmony are… illusions, in some sense. Green looks at someone cutting down a redwood and sees someone who is hurting themself, by forcibly shutting off the parts of themselves that are capable of appreciation and awe. Knowing this doesn’t actually save the redwoods, necessarily, but it does make it far easier to be in a state of acceptance, because deep down nobody is actually your enemy.

Richard_Ngo 16 Jun 2022 7:07 UTC
LW: 50 AF: 22
24
AF
on: A central AI alignment problem: capabilities generalization, and the sharp left turn
Thanks for the post, I think it’s a useful framing. Two things I’d be interested in understanding better:
In the one real example of intelligence being developed we have to look at, continuous application of natural selection in fact found Homo sapiens sapiens, and the capability-gain curves of the ecosystem for various measurables were in fact sharply kinked by this new species (e.g., using machines, we sharply outperform other animals on well-established metrics such as “airspeed”, “altitude”, and “cargo carrying capacity”).
As I said in a reply to Eliezer’s AGI ruin post:
There are some ways in which AGI will be analogous to human evolution. There are some ways in which it will be disanalogous. Any solution to alignment will exploit at least one of the ways in which it’s disanalogous. Pointing to the example of humans without analysing the analogies and disanalogies more deeply doesn’t help distinguish between alignment proposals which usefully exploit disanalogies, and proposals which don’t.
So I’d be curious to know what you think the biggest disanalogies are between the example of human evolution and building AGI. Relatedly, would you consider raising a child to be a “real example of intelligence being developed”; why or why not?
Secondly:
Many different training scenarios are teaching your AI the same instrumental lessons, about how to think in accurate and useful ways. Furthermore, those lessons are underwritten by a simple logical structure
Granting that there’s a bunch of logical structure around how to think in accurate ways (e.g. solving scientific problems), and there’s a bunch of logical structure around how to pursue goals coherently (e.g. avoiding shutdown) what’s the strongest reason to believe that agents won’t learn something closely approximating the former before they learn something closely approximating the latter? My impression of Eliezer’s position is that it’s because they’re basically the same structure—if you agree with this, I’d be curious what sort of intuitions or theorems are most responsible for this belief.
(Another way of phrasing this question: suppose I made an analogous argument before the industrial revolution, saying something like “matter and energy are fundamentally the same thing at a deep level, we’ll soon be able to harness superhuman amounts of energy, therefore we’re soon going to be able to create superhuman amounts of matter”. Yet in fact, while the premise of mass-energy equivalence is true, the constants are such that it takes stupendously more energy than humans can generate, in order to produce human-sized piles of matter. What’s the main thing that makes you think that the constants in the intelligence case are such that AIs will converge to goal-coherence before, or around the same time as, superhuman scientific capabilities?)
What links here?
- Threat Model Literature Review by zac_kenton (1 Nov 2022 11:03 UTC; 74 points)

Richard_Ngo 14 Dec 2022 10:14 UTC
LW: 48 AF: 26
0
AF
on: ricraz’s Shortform
(Written quickly and not very carefully.)
I think it’s worth stating publicly that I have a significant disagreement with a number of recent presentations of AI risk, in particular Ajeya’s “Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover”, and Cohen et al.’s “Advanced artificial agents intervene in the provision of reward”. They focus on policies learning the goal of getting high reward. But I have two problems with this:
1. I expect “reward” to be a hard goal to learn, because it’s a pretty abstract concept and not closely related to the direct observations that policies are going to receive. If you keep training policies, maybe they’d converge to it eventually, but my guess is that this would take long enough that we’d already have superhuman AIs which would either have killed us or solved alignment for us (or at least started using gradient hacking strategies which undermine the “convergence” argument). Analogously, humans don’t care very much at all about the specific connections between our reward centers and the rest of our brains—insofar as we do want to influence them it’s because we care about much more directly-observable phenomena like pain and pleasure.
2. Even once you learn a goal like that, it’s far from clear that it’d generalize in ways which lead to power-seeking. “Reward” is not a very natural concept, it doesn’t apply outside training, and even within training it’s dependent on the specific training algorithm you use. Trying to imagine what a generalized goal of “reward” would cash out to gets pretty weird. As one example: it means that every time you deploy the policy without the intention of rewarding it, then its key priority would be convincing you to inserting that trajectory into the training data. (It might be instructive to think about what the rewards would need to be for that not to happen. Below 0? But the 0 point is arbitrary...) That seems pretty noticeable! But wouldn’t it be deceptive? Well, only within the scope of its current episode, because trying to get higher reward in other episodes is never positively reinforced. Wouldn’t it learn the high-level concept of “reward” in general, in a way that’s abstracted from any specific episode? That feels analogous to a human learning to care about “genetic fitness” but not distinguishing between their own genetic fitness and the genetic fitness of other species. And remember point 1: the question is not whether the policy learns it eventually, but rather whether it learns it before it learns all the other things that make our current approaches to alignment obsolete.
At a high level, this comment is related to Alex Turner’s Reward is not the optimization target. I think he’s making an important underlying point there, but I’m also not going as far as he is. He says “I don’t see a strong reason to focus on the “reward optimizer” hypothesis.” I think there’s a pretty good reason to focus on it—namely that we’re reinforcing policies for getting high reward. I just think that other people have focused on it too much, and not carefully enough—e.g. the “without specific countermeasures” claim that Ajeya makes seems too strong, if the effects she’s talking about might only arise significantly above human level. Overall I’m concerned that reasoning about “the goal of getting high reward” is too anthropomorphic and is a bad way to present the argument to ML researchers in particular.
What links here?
- Two Tales of AI Takeover: My Doubts by Violet Hour (5 Mar 2024 15:51 UTC; 26 points)
- A taxonomy of non-schemer models (Section 1.2 of “Scheming AIs”) by Joe Carlsmith (22 Nov 2023 15:24 UTC; 13 points)

Richard_Ngo 20 Jun 2022 5:27 UTC
48 points
23
in reply to: Zack_M_Davis’s comment on: Where I agree and disagree with Eliezer
But what makes you so confident that it’s not possible for subject-matter experts to have correct intuitions that outpace their ability to articulate legible explanations to others?
Yepp, this is a judgement call. I don’t have any hard and fast rules for how much you should expect experts’ intuitions to plausibly outpace their ability to explain things. A few things which inform my opinion here:
1. Explaining things to other experts should be much easier than explaining them to the public.
2. Explaining things to other experts should be much easier than actually persuading those experts.
3. It’s much more likely that someone has correct intuitions if they have a clear sense of what evidence would make their intuitions stronger.
I don’t think Eliezer is doing particularly well on any of these criteria. In particular, the last one was why I pressed Eliezer to make predictions rather than postdictions in my debate with him. The extent to which Eliezer seemed confused that I cared about this was a noticeable update for me in the direction of believing that Eliezer’s intuitions are less solid than he thinks.
It may be the case that Eliezer has strong object-level intuitions about the details of how intelligence works which he’s not willing to share publicly, but which significantly increase his confidence in his public claims. If so, I think the onus is on him to highlight that so people can make a meta-level update on it.

Richard_Ngo 16 Sep 2018 21:11 UTC
48 points
in reply to: Viliam’s comment on: Realism about rationality
The people who finally find out how the planets move will be spiritual descendants of the group A. … The problem with the group B is that it has no energy to move forward.
In this particular example, it’s true that group A was more correct. This is because planetary physics can be formalised relatively easily, and also because it’s a field where you can only observe and not experiment. But imagine the same conversation between sociologists who are trying to find out what makes people happy, or between venture capitalists trying to find out what makes startups succeed. In those cases, Group B can move forward using the sort of “energy” that biologists and inventors and entrepreneurs have, driven an experimental and empirical mindset. Whereas Group A might spend a long time writing increasingly elegant equations which rely on unjustified simplifications.
Instinctively reasoning about intelligence using analogies from physics instead of the other domains I mentioned above is a very good example of rationality realism.

Richard_Ngo 2 Apr 2023 18:49 UTC
47 points
32
in reply to: jimrandomh’s comment on: Policy discussions follow strong contextualizing norms
The alternative would have been to embed a small lecture about international relations into the article.
I don’t think that’s correct, there are cheap ways of making sentences like this one more effective as communication. (E.g. less passive/vague phrasing than “run some risk”, which could mean many different things.) And I further claim that most smart people, if they actually spent 5 minutes by the clock thinking of the places where there’s the most expected disvalue from being misinterpreted, would have identified that the sentences about nuclear exchanges are in fact likely to be the controversial ones, and that those sentences are easy to misinterpret (or prime others to misinterpret). Communication is hard in general, and we’re not seeing all the places where Eliezer did make sensible edits to avoid being misinterpreted, but I still think this example falls squarely into the “avoidable if actually trying” category.
habitually dumbing things down in that way would be catastrophic. Because we still need to discover a solution to AI alignment, and I don’t think it’s possible to do that without discussing a lot of high-complexity things that can’t be said at all under contextualizing norms
That’s why you do the solving in places that are higher-fidelity than twitter/mass podcasts/open letters/etc, and the communication or summarization in much simpler forms, rather than trying to shout sentences that are a very small edit distance from crazy claims in a noisy room of people with many different communication norms, and being surprised when they’re interpreted differently from how you intended. (Edit: the “shouting” metaphor is referring to twitter, not to the original Time article.)

Richard_Ngo 28 Aug 2023 16:22 UTC
43 points
10
on: Assume Bad Faith
I agree that self-deception is common. But there are at least three reasons why assuming good faith is still a useful strategy:
1. You shouldn’t just think of the assumption of good faith as a reaction to other people’s lack of self-deception; you should also think of it as a way of mitigating your own self-deception. If you assume bad faith, then it’s very easy to talk yourself into doing all sorts of uncharitable or uncooperative rhetorical moves, like lying to them, or yelling at them, or telling them that they only have their position because they’re a bad person. You can tell yourself that you need to work around the self-deceptive parts of them, by pushing the right buttons to make the interaction productive. Yet a lot of these rhetorical moves will in fact be driven by your own hidden motivations, like your desire to avoid backing down, or your desire to punish the outgroup. So assuming good faith gives those motivations less cover.
2. Discussions are an iterated game, and it’s easy for one person to accidentally do something which is interpreted by the other as a sign of bad faith, which causes the second person to respond in kind, and so on. Assuming good faith is like adding (limited) amounts of forgiveness to this tit for tat interaction.
3. While everyone has hidden motives, it’s hard to know which hidden motives are at play in any given discussion. So when Zack says “This is difficult to pull off, which is why most people most of the time should stick to the object level”, this can be seen as another way of saying “actually, as a strong heuristic, assume good faith”.
Having said all this: some people should assume more good faith than they currently do; others less; and it’s hard to know where the line is.

Richard_Ngo 27 Nov 2020 12:02 UTC
42 points
in reply to: alkjash’s comment on: Pain is not the unit of Effort
I am specifically saying that when you measure effort in units of pain this systematically leads to really bad places.
I think this is probably a useful insight, and seems to have resonated with quite a few people.
I’m specifically disputing your further conclusion that people in general should believe: “if it hurts, you’re probably doing it wrong” (and also “You’re not trying your best if you’re not happy.”). In fact, these are quite different from the original claim, and also broader than it, which is why they seem like overstatements to me.
I’m reminded of Buck’s argument that it’s much easier to determine that other people are wrong, than to be right yourself. In this case, even though I buy the criticism of the existing heuristic, proposing new heuristics is a difficult endeavour. Yet you present them as if they follow directly from your original insight. I mean, what justifies claims like “in practice [trading off happiness for short bursts of productivity] is never worth it”? Like, never? Based on what?
I get that this is an occupational hazard of writings posts that are meant to be very motivational/call-to-arms type posts. But you can motivate people without making blanket assertions about how to think about all domains. This seems particularly important on Less Wrong, where there’s a common problem that content of the category “interesting speculation about psychology and society, where I have no way of knowing if it’s true” is interpreted as solid intellectual progress.