Former AI safety research engineer, now AI governance researcher at OpenAI. Blog: thinkingcomplete.com
Richard_Ngo
The ants and the grasshopper
Policy discussions follow strong contextualizing norms
Realism about rationality
Some conceptual alignment research projects
Every “Every Bay Area House Party” Bay Area House Party
Thiel on Progress and Stagnation
Some cruxes on impactful alternatives to AI policy work
The King and the Golem
Succession
Strong +1s to many of the points here. Some things I’d highlight:
Eliezer is not doing the type of reasoning that can justifiably defend the level of confidence he claims to have. If he were, he’d have much more to say about the specific details of consequentialism, human evolution, and the other key intuitions shaping his thinking. In my debate with him he mentioned many times how difficult he’s found it to explain these ideas to people. I think if he understood these ideas well enough to justify the confidence of his claims, then he wouldn’t have found that as difficult. (I’m sympathetic about Eliezer having in the past engaged with many interlocutors who were genuinely very bad at understanding his arguments. However, it does seem like the lack of detail in those arguments is now a bigger bottleneck.)
I think that the intuitions driving Eliezer’s disagreements with many other alignment researchers are interesting and valuable, and would love to have better-fleshed-out explanations of them publicly available. Eliezer would probably have an easier time focusing on developing his own ideas if other people in the alignment community who were pessimistic about various research directions, and understood the broad shape of his intuitions, were more open and direct about that pessimism. This is something I’ve partly done in this post; and I’m glad that Paul’s partly done it here.
I like the analogy of a mathematician having intuitions about the truth of a theorem. I currently think of Eliezer as someone who has excellent intuitions about the broad direction of progress at a very high level of abstraction—but where the very fact that these intuitions are so abstract rules out the types of path-dependencies that I expect solutions to alignment will actually rely on. At this point, people who find Eliezer’s intuitions compelling should probably focus on fleshing them out in detail—e.g. using toy models, or trying to decompose the concept of consequentialism—rather than defending them at a high level.
- 20 Jun 2022 3:08 UTC; 138 points) 's comment on On Deference and Yudkowsky’s AI Risk Estimates by (EA Forum;
- Quantitative cruxes in Alignment by 2 Jul 2023 20:38 UTC; 21 points) (
- The Alignment Problems by 12 Jan 2023 22:29 UTC; 19 points) (
Alignment research exercises
Masterpiece
Suppose we took this whole post and substituted every instance of “cure cancer” with the following:
Version A: “win a chess game against a grandmaster”
Version B: “write a Shakespeare-level poem”
Version C: “solve the Riemann hypothesis”
Version D: “found a billion-dollar company”
Version E: “cure cancer”
Version F: “found a ten-trillion-dollar company”
Version G: “take over the USA”
Version H: “solve the alignment problem”
Version I: “take over the galaxy”And so on. Now, the argument made in version A of the post clearly doesn’t work, the argument in version B very likely doesn’t work, and I’d guess that the argument in version C doesn’t work either. Suppose I concede, though, that the argument in version I works: that searching for an oracle smart enough to give us a successful plan for taking over the galaxy will very likely lead us to develop an agentic, misaligned AGI. Then that still leaves us with the question: what about versions D, E, F, G and H? The argument is structurally identical in each case—so what is it about “curing cancer” that is so hard that, unlike winning chess or (possibly) solving the Riemann hypothesis, when we train for that we’ll get misaligned agents instead?
We might say: well, for humans, curing cancer requires high levels of agency. But humans are really badly optimised for many types of abstract thinking—hence why we can be beaten at chess so easily. So why can’t we also be beaten at curing cancer by systems less agentic than us?
Eliezer has a bunch of intuitions which tell him where the line of “things we can’t do with non-dangerous systems” should be drawn, which I freely agree I don’t understand (although I will note that it’s suspicious how most people can’t do things on the far side of his line, but Einstein can). But insofar as this post doesn’t consider which side of the line curing cancer is actually on, then I don’t think it’s correctly diagnosed the place where Eliezer and I are bouncing off each other.
- Discussion on utilizing AI for alignment by 23 Aug 2022 2:36 UTC; 16 points) (
- 14 Mar 2023 11:32 UTC; 4 points) 's comment on Why I’m not working on {debate, RRM, ELK, natural abstractions} by (
- 1 Mar 2022 8:40 UTC; 4 points) 's comment on Shah and Yudkowsky on alignment failures by (
Disentangling arguments for the importance of AI safety
[$10k bounty] Read and compile Robin Hanson’s best posts
AGI safety career advice
Clarifying and predicting AGI
I feel kinda frustrated whenever “shard theory” comes up in a conversation, because it’s not a theory, or even a hypothesis. In terms of its literal content, it basically seems to be a reframing of the “default” stance towards neural networks often taken by ML researchers (especially deep learning skeptics), which is “assume they’re just a set of heuristics”.
This is a particular pity because I think there’s a version of the “shard” framing which would actually be useful, but which shard advocates go out of their way to avoid. Specifically: we should be interested in “subagents” which are formed via hierarchical composition of heuristics and/or lower-level subagents, and which are increasingly “goal-directed” as you go up the hierarchy. This is an old idea, FWIW; e.g. it’s how Minsky frames intelligence in Society of Mind. And it’s also somewhat consistent with the claim made in the original shard theory post, that “shards are just collections of subshards”.
The problem is the “just”. The post also says “shards are not full subagents”, and that “we currently estimate that most shards are ‘optimizers’ to the extent that a bacterium or a thermostat is an optimizer.” But the whole point of thinking about shards, in my mind, is that it allows us to talk about a gradual spectrum from “heuristic” to “agent”, and how the combination of low-level heuristics may in fact give rise to high-level agents which pursue consequentialist goals. I talk about this in my post on value systematization—e.g. using the example of how normal human moral “shards” (like caring about other people’s welfare) can aggregate into highly-consequentialist utilitarian subagents. In other words, shard advocates seem so determined to rebut the “rational EU maximizer” picture that they’re ignoring the most interesting question about shards—namely, how do rational agents emerge from collections of shards?
(I make a similar point in the appendix of my value systematization post.)
I think there’s a bunch of useful stuff here. In particular, I think that decisions driven by deep-rooted fear are often very counterproductive, and that many rationalists often have “emergency mobilization systems” running in ways which aren’t conducive to good long-term decision-making. I also think that paying attention to bodily responses is a great tool for helping fix this (and in fact was helpful for me in defusing annoyance when reading this post). But I want to push back on the way in which it’s framed in various places as an all-or-nothing: exit the game, or keep playing. Get sober, or stay drunk. Hallucination, not real fear.
In fact, you can do good and important work while also gradually coming to terms with your emotions, trying to get more grounded, and noticing when you’re making decisions driven by visceral fear and taking steps to fix that. Indeed, I expect that almost all good and important work throughout history has been done by people who are at various stages throughout that process, rather than people who first dealt with their traumas and only then turned to the work. (EDIT: in a later comment, Valentine says he doesn’t endorse the claim that people should deal with traumas before doing the work, but does endorse the claim that people should recognize the illusion before doing the work. So better to focus on the latter (I disagree with both).)
(This seems more true for concrete research, and somewhat (but less) true for thinking about high-level strategy. In general it seems that rationalists spend way too much of their time thinking about high-level strategic considerations, and I agree with some of Valentine’s reasoning about why this happens. Instead I’d endorse people trying be much more focused on making progress in a few concrete areas, rather than trying to track everything which they think might be relevant to AI risk. E.g. acceleration is probably bad, but it’s fundamentally a second-order effect, and the energy focused on all but the biggest individual instances of acceleration would probably be better used to focus on first-order effects.)
In other words, I want to offer people the affordance to take on board the (many) useful parts of Valentine’s post without needing to buy into the overall frame in which your current concerns are just a game, and your fear is just a manifestation of trauma.
(Relatedly, from my vantage point it seems that “you need to do the trauma processing first and only then do useful work” is a harmful self-propagating meme in a very similar way as “you need to track and control every variable in order for AI to go well”. Both identify a single dominant consideration which requires your full focus and takes precedence over all others. However, I still think that the former is directionally correct for most rationalists, just as the latter is directionally correct for most non-rationalists.)