Rob Bensinger

Karma: 21,197

Communications lead at MIRI. Unless otherwise indicated, my posts and comments here reflect my own views, and not necessarily my employer’s.

Rob Bensinger 8 Jun 2022 2:50 UTC
8 points
8
in reply to: sullyj3’s comment on: AGI Ruin: A List of Lethalities
If you’re trying to persuade smart programmers who are somewhat wary of sci-fi stuff, and you think nanotech is likely to play a major role in AGI strategy, but you think it isn’t strictly necessary for the current argument you’re making, then my default advice would be:
- Be friendly and patient; get curious about the other person’s perspective, and ask questions to try to understand where they’re coming from; and put effort into showing your work and providing indicators that you’re a reasonable sort of person.
- Wear your weird beliefs on your sleeve; be open about them, and if you want to acknowledge that they sound weird, feel free to do so. At least mention nanotech, even if you choose not to focus on it because it’s not strictly necessary for the argument at hand, it comes with a larger inferential gap, etc.

Rob Bensinger 17 Oct 2021 22:23 UTC
165 points
on: My experience at and around MIRI and CFAR (inspired by Zoe Curzi’s writeup of experiences at Leverage)
Kate Donovan messaged me to say:
I think four people experiencing psychosis in a period of five years, in a community this large with high rates of autism and drug use, is shockingly low relative to base rates.
[...]
A fast pass suggests that my 1-3% for lifetime prevalence was right, but mostly appearing at 15-35.
And since we have conservatively 500 people in the cluster (a lot more people than that attended CFAR workshops or are in MIRI or CFAR’s orbit), 4 is low. Given that I suspect the cluster is larger and I am pretty sure my numbers don’t include drug induced psychosis, just primary psychosis.
The base rate seems important to take into account here, though per Jessica, “Obviously, for every case of poor mental health that ‘blows up’ and is noted, there are many cases that aren’t.” (But I’d guess that’s true for the base-rate stats too?)

Rob Bensinger 18 Oct 2021 1:55 UTC
148 points
in reply to: Eliezer Yudkowsky’s comment on: My experience at and around MIRI and CFAR (inspired by Zoe Curzi’s writeup of experiences at Leverage)
Copying over a related Oct. 13-17 conversation from Facebook:
(context: someone posted a dating ad in a rationalist space where they said they like tarot etc., and rationalists objected)
_____________________________________________
Marie La: As a cultural side note, most of my woo knowledge (like how to read tarot) has come from the rationalist community, and I wouldn’t have learned it otherwise
_____________________________________________
Eliezer Yudkowsky: @Marie La Any ideas how we can stop that?
(+1 from Rob B)
_____________________________________________
Marie La: Idk, it’s an introspective technique that works for some people. Doesn’t particularly work for me. Sounds like the concern is bad optics / PR rather than efficacy
(+1 from Rob B)
_____________________________________________
Shaked Koplewitz: @Marie La optics implies that the concern is with the impression it makes on outsiders, my concern here is the effect on insiders (arguably this is optics too, but a non-central example)
_____________________________________________
Rob Bensinger: If the concern is optics, either to insiders or outsiders, then it seems vastly weaker to me than if the concern is about epistemic methods or conclusions. (Indeed, it might flip the sign for me.)
The rationality community should be about truth and winning, not about linking ourselves up to whatever is culturally associated with the word “rationality”.
The first-order argument for trying weird things and seeing if they work (or just doing them for fun as a sort of game, etc.) makes sense. I’d rather focus on the question of whether that first-order case fails epistemically and/or instrumentally. Also: what does it fail for? Just saying ‘woo’ doesn’t tell us whether, e.g., we should stop using IFS because it isn’t normal-sounding enough.
(+1 from Marie L)
_____________________________________________
Marie La: When we come across someone using weird mind trick X, we should figure out what it does and if we want the results. Being skilled at sorting out good weird mind tricks from bad, regardless of cultural coding, feels like an important rationalist skill.
Tarot is set of fancy art cards can be used in many ways, some that encourage magical thinking and some that provide useful introspective access
For the latter, I’d guess it’s somewhat useful to some people, similar to the skill of flipping a coin and doing what you want anyway
(+1 from Rob B)
_____________________________________________
Eliezer Yudkowsky: @Marie La I disagree and think the woo has proven in empirical practice to be sufficiently destructive to people who can’t see the destruction, to reach a level where it should not be tolerated by this group as a future subgroup norm, same as LSD use shouldn’t be tolerated by us as a subgroup norm.
(+1 from Rob B, Marie L)
_____________________________________________
Marie La: I’m interested in seeing more of your reasoning on this. Pointing out the harm model sounds useful to people who can’t easily see it (or to the people around them) to help avoid further harm in the future
(+1 from Rob B)
_____________________________________________
Eliezer Yudkowsky: The #1 reason why I think it’s harmful isn’t a theory by which I divined it in advance. Though there sure is a very obvious theory whereby the path of sanity is a narrow one and people who step a bit off it in what they fondly imagine to be a controlled way, fall quite a lot further once they’re hanging around with crazier people, crazier ideas, and have already decided to let themselves go a little.
The #1 reason why I think it’s harmful is the number of times you hear about somebody, or worse, some subgroup, that pushed a little woo on somebody, or offered them some psychedelics, and a few years later you’re hearing about how far they went off the deep end. It seems to be destructive in practice and that’s a far stronger reason to be wary than the obvious-seeming ways it could be destructive in principle.
(+1 from Anonymous, Rob B)
_____________________________________________
Anonymous: Agree with this but surely it also matters how often that action seems to have that effect *out of the times it’s done* - noting this because my impression (which however I don’t have data for) is that psychedelic use might be locally common enough that it’s only a small proportion of “rationalists who try LSD” who end up “going off the deep end”. Whereas experimenting with woo-y beliefs seems more strongly associated with that kind of trajectory and for that I endorse your conclusion.
(+1 from Eliezer Y)
_____________________________________________
Aella: @Eliezer Yudkowsky On phone so thumb words but I notice I have a belief that this is predictable, and thus not dangerous? or rather, it’s something like if you’re religious and noticed some ppl have been drinking alcohol and then eventually losing their faith, you might be right to be wary of alcohol, but if you know that it’s actually the doubt of their faith that *causes* the alcohol drinking, then you wouldn’t be concerned if someone drinks alcohol but also isn’t doubting their faith.
similarly I have some intuition here that the woo stuff is a symptom and not the cause, and that it’s very possible to engage harmlessly with the symptom alone, and is a fine social norm if people can distinguish the alcoholism from doubting your faith.
I do agree seeing woo belief does up my probability they might end up going off the deep end tho
(+1 from Rob B, Eliezer Y)
_____________________________________________
Eliezer Yudkowsky: My sense of “this seems to be ending very poorly on average” is much stronger for situations in which there’s a Leader or a Discernible Subgroup has formed, that are going up to others and saying “why, you really should try some psychedelics / woo”. Or where they wander up to individuals trying that, and put their arm around their shoulders all friendly-like.
Though I suppose that could also be because I’m much less likely to hear about the individual non-social cases even if they end poorly. And indeed, my sense of the individual cases, is that I have heard of a lot more individuals who took psychedelics a few times in a situation devoid of Leaders and Subgroups and nothing bad happened to them; compared to the case with woo, where it feels like I’m more likely to have heard that an individual who tried woo even in a situation without Leaders or Subgroups later went further off the deep end. Which has the very obvious explanation that some people ever do benefit from psychedelics and they are plausibly interesting to a healthy mind willing to risk itself, you can be sane and still try shrooms ever; while the woo thing requires a larger, more willing step off the strict rails of sanity.
But in the case of Subgroups or Leaders, neither woo nor psychedelics seems to end well.
And to be clear that this isn’t just disguised “boo subgroups and leaders”, let’s be clear that, say, OpenPhil is for these purposes a Subgroup and Holden Karnofsky is a Leader, as are MIRI and myself; they are just Subgroups and Leaders which have not, to my knowledge, ever advocated LSD or tarot readings.
(+1 from Rob B, Anonymous, Marie L)
_____________________________________________
Marie La: @Eliezer Yudkowsky Strong agree that psychedelics + a strong leader context can enable great harm quickly.
It tends to make people vulnerable, and if there’s a bad actor around this can be dangerous. This has been plenty weaponized for controlling people in cult-like groups.
I’d expand this to include most drugs, but especially classical psychs and mdma.
Here I’d blame the leader for seeking out vulnerable psychedelic users or encouraging people under them to use psychedelics, rather than the drugs themselves
(I can’t say much about weaponized woo, don’t know what that looks like as much)
(+1 from Anonymous)
_____________________________________________
Rob Bensinger: https://slatestarcodex.com/2019/09/10/ssc-journal-club-relaxed-beliefs-under-psychedelics-and-the-anarchic-brain suggests that psychedelics permanently “relax” people’s perceptual and epistemic priors, and that this is maybe why they can cause hallucinogen persisting perceptual disorder and crazy beliefs.
This seems maybe much worse for people whose starting priors are quite good. If your life is a wreck and you’re terrible at figuring out what’s true, then yeah, shaking things up might be great. But if rationalists are selected for having unusually *good* priors, then shaking things up will cause regression to the mean.
Cf. Jim Babcock’s argument that you shouldn’t be experimenting with new diets if you’re already an unusually awesome, productive, etc. person, since then you risk breaking a complex system that’s working well. It’s the people whose status quo sucks who should be experimenting.
(+1 from Jim B, Eliezer Y, Anonymous, Marie L)
_____________________________________________
Jim Babcock: My own impression is that the effect of LSD is not primarily a regression to the mean thing, but rather, that it temporarily enables some self-modification capabilities, which can be powerfully positive but which require a high degree of sanity and care to operate safely. When I see other people using psychedelics, very often I see them acting like Harry Potter experimenting with transfiguration, or worse, treating it as *entertainment*. And I want to yell at them, and point them at the scene in Dumbledore’s laboratory where Dumbledore and Minerva go through a checklist and have a levels-of-precaution framework and have a step where they actually stop and think before they begin.
(+1 from Rob B, Marie L)
_____________________________________________
Eliezer Yudkowsky: I worry that we’re shooting ourselves in the foot by telling ourselves that psychedelics “temporarily enable some self-modification capabilities” rather than doing shit to the brain that we don’t understand, and we know a bunch of people who seemed a lot more promising and sane before they did some psychedelics, and now they’re not the people they were anymore and not in a good way, and there is not in fact any good way to be sure of who that happens to because it did not seem very predictable in advance at the time, and maybe you can roll the dice on that if you’re tired of being yourself and want to take a bet with high variance and negative expected value, but you sure don’t do it in little subgroups that put an arm around somebody’s shoulder and make helpful offers.
(+1 from Rob B, Marie L, Jim B)
_____________________________________________
Jim Babcock: I suspect you may be underestimating the base rate of people using psychedelics discretely and having a neutral or mild-positive effect that they don’t talk about, and also underestimating the degree of stupidity that people bring to bear in their drug use.
At Burning Man, I saw a lot of stuff like: making dosing and mixing decisions while not sober; taking doses without a scale, not beig justifiably confident that the dose is the right order of magnitude; taking their epistemically-untrustworthy friend’s word that a substance they’ve never heard of before is safe for them. That sort of thing. And that’s before even getting into the social stuff, where eg people really shouldn’t be having conversations about crazymaking topics while high. And the legitimately hard stuff, like it’s important to have a tripsitter who will break thought loops if you step on a trauma trigger, but also important that that tripsitter not be a significant other or cult leader and not be someone who will talk to you about crazymaking topics.
Meanwhile nearly everyone has been exposed to extremely unsubtle and substantially false anti-drug propaganda, which fails to survive contact with reality. So it’s unfortunate but also unsurprising that the how-much-caution pendulum in their heads winds up swinging too far to the other side. The ideal messaging imo would leave most people feeling like planning an acid trip is more work than they personally will get around to, plus mild disdain towards impulsive usage and corner-cutting.
(+1 from Rob B, Marie L)

Rob Bensinger 2 Apr 2022 7:31 UTC
125 points
0
in reply to: Alex Vermillion’s comment on: MIRI announces new “Death With Dignity” strategy
Yeah—I love AI_WAIFU’s comment, but I love the OP too.
To some extent I think these are just different strategies that will work better for different people; both have failure modes, and Eliezer is trying to guard against the failure modes of ‘Fuck That Noise’ (e.g., losing sight of reality), while AI_WAIFU is trying to guard against the failure modes of ‘Try To Die With More Dignity’ (e.g., losing motivation).
My general recommendation to people would be to try different framings / attitudes out and use the ones that empirically work for them personally, rather than trying to have the same lens as everyone else. I’m generally a skeptic of advice, because I think people vary a lot; so I endorse the meta-advice that you should be very picky about which advice you accept, and keep in mind that you’re the world’s leading expert on yourself. (Or at least, you’re in the best position to be that thing.)
Cf. ‘Detach the Grim-o-Meter’ versus ‘Try to Feel the Emotions that Match Reality’. Both are good advice in some contexts, for some people; but I think there’s some risk from taking either strategy too far, especially if you aren’t aware of the other strategy as a viable option.
What links here?
- Mental Health and the Alignment Problem: A Compilation of Resources (updated April 2023) by Chris Scammell (10 May 2023 19:04 UTC; 251 points)

Rob Bensinger 2 Apr 2022 8:13 UTC
112 points
in reply to: Amelia Bedelia’s comment on: MIRI announces new “Death With Dignity” strategy
I may be misunderstanding, but I interpreted Eliezer as drawing this contrast:
- Good Strategy: Try to build maximally accurate models of the real world (even though things currently look bad), while looking for new ideas or developments that could save the world. Ideally, the ideas the field puts a lot of energy into should be ones that already seem likely to work, or that seem likely to work under a wide range of disjunctive scenarios. (Failing that, they at least shouldn’t require multiple miracles, and should lean on a miracle that’s unusually likely.)
- Bad Strategy: Reason “If things are as they appear, then we’re screwed anyway; so it’s open season on adopting optimistic beliefs.” Freely and casually adopt multiple assumptions based on wishful thinking, and spend your mental energy thinking about hypothetical worlds where things go unusually well in specific ways you’re hoping they might (even though, stepping back, you wouldn’t have actually bet on those optimistic assumptions being true).
What links here?
- Rob Bensinger's comment on MIRI announces new “Death With Dignity” strategy by Eliezer Yudkowsky (28 May 2022 22:33 UTC; 9 points)

Rob Bensinger 12 Nov 2021 19:13 UTC
LW: 85 AF: 35
AF
in reply to: Rob Bensinger’s comment on: Discussion with Eliezer Yudkowsky on AGI interventions
Also, I feel like I want to emphasize that, like… it’s OK to believe that the field you’re working in is in a bad state? The social pressure against saying that kind of thing (or even thinking it to yourself) is part of why a lot of scientific fields are unhealthy, IMO. I’m in favor of you not taking for granted that Eliezer’s right, and pushing back insofar as your models disagree with his. But I want to advocate against:
- Saying false things about what the other person is saying. A lot of what you’ve said about Eliezer and MIRI is just obviously false (e.g., we have contempt for “experimental work” and think you can’t make progress by “Actually working with AIs and Thinking about real AIs”).
- Shrinking the window of ‘socially acceptable things to say about the field as a whole’ (as opposed to unsolicited harsh put-downs of a particular researcher’s work, where I see more value in being cautious).
I want to advocate ‘smack-talking the field is fine, if that’s your honest view; and pushing back is fine, if you disagree with the view’. I want to see more pushing back on the object level (insofar as people disagree), and less ‘how dare you say that, do you think you’re the king of alignment or something’ or ‘saying that will have bad social consequences’.
I think you’re picking up on a real thing of ‘a lot of people are too deferential to various community leaders, when they should be doing more model-building, asking questions, pushing back where they disagree, etc.’ But I think the solution is to shift more of the conversation to object-level argument (that is, modeling the desired behavior), and make that argument as high-quality as possible.

Rob Bensinger 31 May 2022 21:00 UTC
LW: 83 AF: 24
AF
in reply to: Chris Fong’s comment on: Six Dimensions of Operational Adequacy in AGI Projects
Discouraged. Eliezer and Nate feel that their past alignment research efforts failed, and they don’t currently know of a new research direction that feels promising enough that they want to focus their own time on advancing it, or make it MIRI’s organizational focus.
I do think ‘trying to directly solve the alignment problem’ is the most useful thing the world can be doing right now, even if it’s not Eliezer or Nate’s comparative advantage right now. A good way to end up with a research direction EY or Nate are excited by, IMO, is for hundreds of people to try hundreds of different angles of attack and see if any bear fruit. Then a big chunk of the field can pivot to whichever niche approach bore the most fruit.
From MIRI’s perspective, the hard part is that:
- (a) we don’t know in advance which directions will bear fruit, so we need a bunch of people to go make unlikely bets so we can find out;
- (b) there currently aren’t that many people trying to solve the alignment problem at all; and
- (c) nearly all of the people trying to solve the problem full-time are adopting unrealistic optimistic assumptions about things like ‘will alignment generalize as well as capabilities?’ and ‘will the first pivotal AI systems be safe-by-default?’, in such a way that their research can’t be useful if we’re in the mainline-probability world.
What I’d like to see instead is more alignment research, and especially research of the form “this particular direction seems unlikely to succeed, but if it succeeds then it will in fact help a lot in mainline reality”, as opposed to directions that (say) seem a bit likelier to succeed but won’t actually help in the mainline world.
(In principle you want nonzero effort going into both approaches, but right now the field is almost entirely in the second camp, from MIRI’s perspective. And making a habit of assuming your way out of mainline reality is risky business, and outright dooms your research once you start freely making multiple such assumptions.)

Rob Bensinger 7 Apr 2022 1:21 UTC
82 points
in reply to: EdricBroadhurst’s comment on: MIRI announces new “Death With Dignity” strategy
(I ran this comment by Eliezer and Nate and they endorsed it.)
My model is that the post accurately and honestly represents Eliezer’s epistemic state (‘I feel super doomy about AI x-risk’), and a mindset that he’s found relatively motivating given that epistemic state (‘incrementally improve our success probability, without getting emotionally attached to the idea that these incremental gains will result in a high absolute success probability’), and is an honest suggestion that the larger community (insofar as it shares his pessimism) adopt the same framing for the sake of guarding against self-deception and motivated reasoning.
The parts of the post that are an April Fool’s Joke, AFAIK, are the title of the post, and the answer to Q6. The answer to Q6 is a joke because it’s sort-of-pretending the rest of the post is an April Fool’s joke. The title is a joke because “X’s new organizational strategy is ‘death with dignity’” sounds sort of inherently comical, and doesn’t really make sense (how is that a “strategy”? believing p(doom) is high isn’t a strategy, and adopting a specific mental framing device isn’t really a “strategy” either). (I’m even more confused by how this could be MIRI’s “policy”.)
In case it clarifies anything, here are some possible interpretations of ‘MIRI’s new strategy is “Death with Dignity”’, plus a crisp statement of whether the thing is true or false:
- A plurality of MIRI’s research leadership, adjusted for org decision-making weight, thinks humanity’s success probability is very low, and will (continue to) make org decisions accordingly. — True, though:
  - Practically speaking, I don’t think this is wildly different from a lot of MIRI’s past history. Eg., Nate’s stated view in 2014 (assuming FT’s paraphrase is accurate), before he became ED, was “there is only a 5 per cent chance of programming sufficient safeguards into advanced AI”.
    (Though I think there was at least one period of time in the intervening years where Nate had double-digit success probabilities for humanity — after the Puerto Rico conference and associated conversations, where he was impressed by the spirit of cooperation and understanding present and by how on-the-ball some key actors looked. He tells me that he later updated back downwards when the political situation degraded, and separately when he concluded the people in question weren’t that on-the-ball after all.)
  - MIRI is strongly in favor of its researchers building their own models and doing the work that makes sense to them; individual MIRI researchers’ choices of direction don’t require sign-off from Eliezer or Nate.
  - I don’t know exactly why Eliezer wrote a post like this now, but I’d guess the largest factors are roughly (1) that Eliezer and Nate have incrementally updated over the years from ‘really quite gloomy’ to ‘even gloomier’, (2) that they’re less confident about what object-level actions would currently best reduce p(doom), and (3) that as a consequence, they’ve updated a lot toward existential wins being likelier if the larger community moves toward having much more candid and honest conversations, and generally produces more people who are thinking exceptionally clearly about the problem.
- Everyone on MIRI’s research team thinks our success probability is extremely low (say, below 5%). — False, based on a survey I ran a year ago. Only five MIRI researchers responded, so the sample might skew much more negative or positive than the overall distribution of views at MIRI; but MIRI responses to Q2 were (66%, 70%, 70%, 96%, 98%). I also don’t think the range of views has changed a ton in the intervening year.
- MIRI will require (of present and/or future research staff) that they think in terms of “death with dignity”. — False, both in that MIRI isn’t in the business of dictating researchers’ P(doom) and in that MIRI isn’t in the business of dictating researchers’ motivational tools or framing devices.
- MIRI has decided to give up on reducing existential risk from AI. — False, obviously.
- MIRI is “locking in” pessimism as a core part of its org identity, such that it refuses to update toward optimism if the situation starts looking better. — False, obviously.
Other than the two tongue-in-cheek parts, AFAIK the post is just honestly stating Eliezer’s views, without any more hyperbole than a typical Eliezer post would have. E.g., the post is not “a preview of what might be needful to say later, if matters really do get that desperate”. Some parts of the post aren’t strictly literal (e.g., “0% probability”), but that’s because all of Eliezer’s posts are pretty colloquial, not because of a special feature of this post.
What links here?

Rob Bensinger 12 Nov 2021 19:06 UTC
LW: 81 AF: 32
AF
in reply to: adamShimi’s comment on: Discussion with Eliezer Yudkowsky on AGI interventions
Thanks for naming specific work you think is really good! I think it’s pretty important here to focus on the object-level. Even if you think the goodness of these particular research directions isn’t cruxy (because there’s a huge list of other things you find promising, and your view is mainly about the list as a whole rather than about any particular items on it), I still think it’s super important for us to focus on object-level examples, since this will probably help draw out what the generators for the disagreement are.
John Wentworth’s Natural Abstraction Hypothesis, which is about checking his formalism-backed intuition that NNs actually learn similar abstractions that humans do. The success story is pretty obvious, in that if John is right, alignment should be far easier.
Eliezer liked this post enough that he asked me to signal-boost it in the MIRI Newsletter back in April.
And Paul Christiano and Stuart Armstrong are two of the people Eliezer named as doing very-unusually good work. We continue to pay Stuart to support his research, though he’s mainly supported by FHI.
And Evan works at MIRI, which provides some Bayesian evidence about how much we tend to like his stuff. :)
So maybe there’s not much disagreement here about what’s relatively good? (Or maybe you’re deliberately picking examples you think should be ‘easy sells’ to Steel Eliezer.)
The main disagreement, of course, is about how absolutely promising this kind of stuff is, not how relatively promising it is. This could be some of the best stuff out there, but my understanding of the Adam/Eliezer disagreement is that it’s about ‘how much does this move the dial on actually saving the world?’ / ‘how much would we move the dial if we just kept doing more stuff like this?’.
Actually, this feels to me like a thing that your comments have bounced off of a bit. From my perspective, Eliezer’s statement was mostly saying ‘the field as a whole is failing at our mission of preventing human extinction; I can name a few tiny tidbits of relatively cool things (not just MIRI stuff, but Olah and Christiano), but the important thing is that in absolute terms the whole thing is not getting us to the world where we actually align the first AGI systems’.
My Eliezer-model thinks nothing (including MIRI stuff) has moved the dial much, relative to the size of the challenge. But your comments have mostly been about a sort of status competition between decision theory stuff and ML stuff, between prosaic stuff and ‘gain new insights into intelligence’ stuff, between MIRI stuff and non-MIRI stuff, etc. This feels to me like it’s ignoring the big central point (‘our work so far is wildly insufficient’) in order to haggle over the exact ordering of the wildly-insufficient things.
You’re zeroed in on the “vast desert” part, but the central point wasn’t about the desert-oasis contrast, it was that the whole thing is (on Eliezer’s model) inadequate to the task at hand. Likewise, you’re talking a lot about the “fake” part (and misstating Eliezer’s view as “everyone else [is] a faker”), when the actual claim was about “work that seems to me to be mostly fake or pointless or predictable” (emphasis added).
Maybe to you these feel similar, because they’re all just different put-downs. But… if those were true descriptions of things about the field, they would imply very different things.
I would like to put forward that Eliezer thinks, in good faith, that this is the best hypothesis that fits the data. I absolutely think reasonable people can disagree with Eliezer on this, and I don’t think we need to posit any bad faith or personality failings to explain why people would disagree.
What links here?

Rob Bensinger 10 Dec 2021 11:41 UTC
LW: 80 AF: 35
AF
in reply to: davidad’s comment on: Biology-Inspired AGI Timelines: The Trick That Never Works
Making a map of your map is another one of those techniques that seem to provide more grounding but do not actually.
Sounds to me like one of the things Eliezer is pointing at in Hero Licensing:
Look, thinking things like that is just not how the inside of my head is organized. There’s just the book I have in my head and the question of whether I can translate that image into reality. My mental world is about the book, not about me.
You do want to train your brain, and you want to understand your strengths and weaknesses. But dwelling on your biases at the expense of the object level isn’t actually usually the best way to give your brain training data and tweak its performance.
I think there’s a lesson here that, e.g., Scott Alexander hadn’t fully internalized as of his 2017 Inadequate Equilibria review. There’s a temptation to “go meta” and find some cleaner, more principled, more objective-sounding algorithm to follow than just “learn lots and lots of object-level facts so you can keep refining your model, learn some facts about your brain too so you can know how much to trust it in different domains, and just keep doing that”.
But in fact there’s no a priori reason to expect there to be a shortcut that lets you skip the messy unprincipled your-own-perspective-privileging Bayesian Updating thing. Going meta is just a tool in the toolbox, and it’s risky to privilege it on ‘sounds more objective/principled’ grounds when there’s neither a theoretical argument nor an empirical-track-record argument for expecting that approach to actually work.
Teaching the low-description-length principles of probability to your actual map-updating system is much more feasible (or at least more cost-effective) than emitting your actual map into a computationally realizable statistical model.
I think this is a good distillation of Eliezer’s view (though I know you’re just espousing your own view here). And of mine, for that matter. Quoting Hero Licensing again:
STRANGER: I believe the technical term for the methodology is “pulling numbers out of your ass.” It’s important to practice calibrating your ass numbers on cases where you’ll learn the correct answer shortly afterward. It’s also important that you learn the limits of ass numbers, and don’t make unrealistic demands on them by assigning multiple ass numbers to complicated conditional events.
ELIEZER: I’d say I reached the estimate… by thinking about the object-level problem? By using my domain knowledge? By having already thought a lot about the problem so as to load many relevant aspects into my mind, then consulting my mind’s native-format probability judgment—with some prior practice at betting having already taught me a little about how to translate those native representations of uncertainty into 9:1 betting odds.
One framing I use is that there are two basic perspectives on rationality:
- Prosthesis: Human brains are naturally bad at rationality, so we can identify external tools (and cognitive tech that’s too simple and straightforward for us to misuse) and try to offload as much of our reasoning as possible onto those tools, so as to not have to put weight down (beyond the bare minimum necessary) on our own fallible judgment.
- Strength training: There’s a sense in which every human has a small AGI (or a bunch of AGIs) inside their brain. If we didn’t have access to such capabilities, we wouldn’t be able to do complicated ‘planning and steering of the world into future states’ at all.
  
  It’s true that humans often behave ‘irrationally’, in the sense that we output actions based on simpler algorithms (e.g., reinforced habits and reflex behavior) that aren’t doing the world-modeling or future-steering thing. But if we want to do better, we mostly shouldn’t be leaning on weak reasoning tools like pocket calculators; we should be focusing our efforts on more reliably using (and providing better training data) the AGI inside our brains. Nearly all of the action (especially in hard foresight-demanding domains like AI alignment) is in improving your inner AGI’s judgment, intuitions, etc., not in outsourcing to things that are way less smart than an AGI.
In practice, of course, you should do some combination of the two. But I think a lot of the disagreements MIRI folks have with other people in the existential risk ecosystem are related to us falling on different parts of the prosthesis-to-strength-training spectrum.
Techniques that give the illusion of objectivity are usually not useless. But to use them effectively, you have to see through the illusion of objectivity, and treat their outputs as observations of what those techniques output, rather than as glimpses at the light of objective reasonableness.
Strong agreement. I think this is very well-put.

Rob Bensinger 8 Jun 2022 7:53 UTC
LW: 79 AF: 29
11
AF
on: AGI Ruin: A List of Lethalities
On Twitter, Eric Rogstad wrote:
“the thing where it keeps being literally him doing this stuff is quite a bad sign”
I’m a bit confused by this part. Some thoughts on why it seems odd for him (or others) to express that sentiment...
1. I parse the original as, “a collection of EY’s thoughts on why safe AI is hard”. It’s EY’s thoughts, why would someone else (other than @robbensinger) write a collection of EY’s thoughts?
(And if we generalize to asking why no-one else would write about why safe AI is hard, then what about Superintelligence, or the AI stuff in cold-takes, or …?)
2. Was there anything new in this doc? It’s prob useful to collect all in one place, but we don’t ask, “why did no one else write this” for every bit of useful writing out there, right?
Why was it so overwhelmingly important that someone write this summary at this time, that we’re at all scratching our heads about why no one else did it?
Copying over my reply to Eric:
My shoulder Eliezer (who I agree with on alignment, and who speaks more bluntly and with less hedging than I normally would) says:
1. The list is true, to the best of my knowledge, and the details actually matter.
  
  Many civilizations try to make a canonical list like this in 1980 and end up dying where they would have lived just because they left off one item, or under-weighted the importance of the last three sentences of another item, or included ten distracting less-important items.
2. There are probably not many civilizations that wait until 2022 to make this list, and yet survive.
3. It’s true that many of the points in the list have been made before. But it’s very doomy that they were made by me.
4. Nearly all of the field’s active alignment research is predicated on a false assumption that’s contradicted by one of the items in sections A or B. If the field had recognized everything in A and B sooner, we could have put our recent years of effort into work that might actually help on the mainline, as opposed to work that just hopes a core difficulty won’t manifest and has no Plan B for what to do when reality says “no, we’re on the mainline”.
So the answer to ‘Why would someone else write EY’s thoughts?’ is ‘It has nothing to do with an individual’s thoughts; it’s about civilizations needing a very solid and detailed understanding of what’s true on these fronts, or they die’.
Re “(And if we generalize to asking why no-one else would write about why safe AI is hard, then what about Superintelligence, or the AI stuff in cold-takes, or …?)”:
The point is not ‘humanity needs to write a convincing-sounding essay for the thesis Safe AI Is Hard, so we can convince people’. The point is ‘humanity needs to actually have a full and detailed understanding of the problem so we can do the engineering work of solving it’.
If it helps, imagine that humanity invents AGI tomorrow and has to actually go align it now. In that situation, you need to actually be able to do all the requisite work, not just be able to write essays that would make a debate judge go ‘ah yes, well argued.’
When you imagine having water cooler arguments about the importance of AI alignment work, then sure, it’s no big deal if you got a few of the details wrong.
When you imagine actually trying to build aligned AGI the day after tomorrow, I think it comes much more into relief why it matters to get those details right, when the “details” are as core and general as this.
I think that this is a really good exercise that more people should try. Imagine that you’re running a project yourself that’s developing AGI first, in real life. Imagine that you are personally responsible for figuring out how to make the thing go well. Yes, maybe you’re not the perfect person for the job; that’s a sunk cost. Just think about what specific things you would actually do to make things go well, what things you’d want to do to prepare 2 years or 6 years in advance, etc.
Try to think your way into near-mode with regard to AGI development, without thereby assuming (without justification) that it must all be very normal just because it’s near. Be able to visualize it near-mode and weird/novel. If it helps, start by trying to adopt a near-mode, pragmatic, gearsy mindset toward the weirdest realistic/plausible hypothesis first, then progress to the less-weird possibilities.
I think there’s a tendency for EAs and rationalists to instead fall into one of these two mindsets with regard to AGI development, pivotal acts, etc.:
1. Fun Thought Experiment Mindset. On this mindset, pivotal acts, alignment, etc. are mentally categorized as a sort of game, a cute intellectual puzzle or a neat thing to chat about.
  
  This is mostly a good mindset, IMO, because it makes it easy to freely explore ideas, attend to the logical structure of arguments, brainstorm, focus on gears, etc.
  
  Its main defect is a lack of rigor and a more general lack of drive: because on some level you’re not taking the question seriously, you’re easily distracted by fun, cute, or elegant lines of thought, and you won’t necessarily push yourself to red-team proposals, spontaneously take into account other pragmatic facts/constraints you’re aware of from outside the current conversational locus, etc. The whole exercise sort of floats in a fantasy bubble, rather than being a thing people bring their full knowledge, mental firepower, and lucidity/rationality to bear on.
2. Serious Respectable Person Mindset. Alternatively, when EAs and rationalists do start taking this stuff seriously, I think they tend to sort of turn off the natural flexibility, freeness, and object-levelness of their thinking, and let their mind go to a very fearful or far-mode place. The world’s gears become a lot less salient, and “Is it OK to say/think that?” becomes a more dominant driver of thought.
  
  Example: In Fun Thought Experiment Mindset, IME, it’s easier to think about governments in a reductionist and unsentimental way, as specific messy groups of people with specific institutional dysfunctions, psychological hang-ups, etc. In Serious Respectable Person Mindset, there’s more of a temptation to go far-mode, glom on to happy-sounding narratives and scenarios, or even just resist the push to concretely visualize the future at all—thinking instead in terms of abstract labels and normal-sounding platitudes.
The entire fact that EA and rationalism mostly managed to avert their gaze from the concept of “pivotal acts” for years, is in my opinion an example of how these two mindsets often fail.
“In the endgame, AGI will probably be pretty competitive, and if a bunch of people deploy AGI then at least one will destroy the world” is a thing I think most LWers and many longtermist EAs would have considered obvious. As a community, however, we mostly managed to just-not-think the obvious next thought, “In order to prevent the world’s destruction in this scenario, one of the first AGI groups needs to find some fast way to prevent the proliferation of AGI.”
Fun Thought Experiment Mindset, I think, encouraged this mental avoidance because it thought of AGI alignment (to some extent) as a fun game in the genre of “math puzzle” or “science fiction scenario”, not as a pragmatic, real-world dilemma we actually have to solve, taking into account all of our real-world knowledge and specific facts on the ground. The ‘rules of the game’, many people apparently felt, were to think about certain specific parts of the action chain leading up to an awesome future lightcone, rather than taking ownership of the entire problem and trying to figure out what humanity should in-real-life do, start to finish.
(What primarily makes this weird is that many alignment questions crucially hinge on ‘what task are we aligning the AGI on?’. These are not remotely orthogonal topics.)
Serious Respectable Person Mindset, I think, encouraged this mental avoidance more actively, because pivotal acts are a weird and scary-sounding idea once you leave ‘these are just fun thought experiments’ land.
What I’d like to see instead is something like Weirdness-Tolerant Project Leader Mindset, or Thought Experiments Plus Actual Rigor And Pragmatism And Drive Mindset, or something.
I think a lot of the confusion around EY’s post comes from the difference between thinking of these posts (on some level) as fun debate fodder or persuasion/outreach tools, versus attending to the fact that humanity has to actually align AGI systems if we’re going to make it out of this problem, and this is an attempt by humanity to distill where we’re currently at, so we can actually proceed to go solve alignment right now and save the world.
Imagine that this is v0 of a series of documents that need to evolve into humanity’s (/ some specific group’s) actual business plan for saving the world. The details really, really matter. Understanding the shape of the problem really matters, because we need to engineer a solution, not just ‘persuade people to care about AI risk’.
If you disagree with the OP… that’s pretty important! Share your thoughts. If you agree, that’s important to know too, so we can prioritize some disagreements over others and zero in on critical next actions. There’s a mindset here that I think is important, that isn’t about “agree with Eliezer on arbitrary topics” or “stop thinking laterally”; it’s about approaching the problem seriously, neither falling into despair nor wishful thinking, neither far-mode nor forced normality, neither impracticality nor propriety.
What links here?

Rob Bensinger 2 Apr 2022 18:33 UTC
74 points
in reply to: Alex_Altair’s comment on: MIRI announces new “Death With Dignity” strategy
I primarily upvoted it because I like the push to ‘just candidly talk about your models of stuff’:
I think we die with slightly more dignity—come closer to surviving, as we die—if we are allowed to talk about these matters plainly. Even given that people may then do unhelpful things, after being driven mad by overhearing sane conversations. I think we die with more dignity that way, than if we go down silent and frozen and never talking about our impending death for fear of being overheard by people less sane than ourselves.
I think that in the last surviving possible worlds with any significant shred of subjective probability, people survived in part because they talked about it; even if that meant other people, the story’s antagonists, might possibly hypothetically panic.
Also because I think Eliezer’s framing will be helpful for a bunch of people working on x-risk. Possibly a minority of people, but not a tiny minority. Per my reply to AI_WAIFU, I think there are lots of people who make the two specific mistakes Eliezer is warning about in this post (‘making a habit of strategically saying falsehoods’ and/or ‘making a habit of adopting optimistic assumptions on the premise that the pessimistic view says we’re screwed anyway’).
The latter, especially, is something I’ve seen in EA a lot, and I think the arguments against it here are correct (and haven’t been talked about much).

Rob Bensinger 2 Apr 2022 18:23 UTC
73 points
in reply to: Delete account ’s comment on: MIRI announces new “Death With Dignity” strategy
+1 for asking the 101-level questions! Superintelligence, “AI Alignment: Why It’s Hard, and Where to Start”, “There’s No Fire Alarm for Artificial General Intelligence”, and the “Security Mindset” dialogues (part one, part two) do a good job of explaining why people are super worried about AGI.
“There’s no hope for survival” is an overstatement; the OP is arguing “successfully navigating AGI looks very hard, enough that we should reconcile ourselves with the reality that we’re probably not going to make it”, not “successfully navigating AGI looks impossible / negligibly likely, such that we should give up”.
If you want specific probabilities, here’s a survey I ran last year: https://www.lesswrong.com/posts/QvwSr5LsxyDeaPK5s/existential-risk-from-ai-survey-results. Eliezer works at MIRI (as do I), and MIRI views tended to be the most pessimistic.

Rob Bensinger 30 May 2022 21:30 UTC
LW: 70 AF: 16
AF
in reply to: Yitz’s comment on: Six Dimensions of Operational Adequacy in AGI Projects
It’s been high on some MIRI staff’s “list of things we want to release” over the years, but we repeatedly failed to make a revised/rewritten version of the draft we were happy with. So I proposed that we release a relatively unedited version of Eliezer’s original draft, and Eliezer said he was okay with that (provided we sprinkle the “Reminder: This is a 2017 document” notes throughout).
We’re generally making a push to share a lot of our models (expect more posts soon-ish), because we’re less confident about what the best object-level path is to ensuring the long-term future is awesome, so (as I described in April) we’ve “updated a lot toward existential wins being likelier if the larger community moves toward having much more candid and honest conversations, and generally produces more people who are thinking exceptionally clearly about the problem”.
I think this was always plausible to some degree, but it’s grown in probability; and model-sharing is competing against fewer high-value uses of Eliezer and Nate’s time now that they aren’t focusing their own current efforts on alignment research.

Rob Bensinger 2 Apr 2022 19:16 UTC
70 points
0
in reply to: Rob Bensinger’s comment on: MIRI announces new “Death With Dignity” strategy
A better summary of my attitude:

Rob Bensinger 11 Nov 2021 19:35 UTC
65 points
in reply to: Logan Riggs’s comment on: Discussion with Eliezer Yudkowsky on AGI interventions
Some things that seem important to distinguish here:
- ‘Prosaic alignment is doomed’. I parse this as: ‘Aligning AGI, without coming up with any fundamentally new ideas about AGI/intelligence or discovering any big “unknown unknowns” about AGI/intelligence, is doomed.’
  - I (and my Eliezer-model) endorse this, in large part because ML (as practiced today) produces such opaque and uninterpretable models. My sense is that Eliezer’s hopes largely route through understanding AGI systems’ internals better, rather than coming up with cleverer ways to apply external pressures to a black box.
- ‘All alignment work that involves running experiments on deep nets is doomed’.
  - My Eliezer-model doesn’t endorse this at all.
Also important to distinguish, IMO (making up the names here):
- A strong ‘prosaic AGI’ thesis, like ‘AGI will just be GPT-n or some other scaled-up version of current systems’. Eliezer is extremely skeptical of this.
- A weak ‘prosaic AGI’ thesis, like ‘AGI will involve coming up with new techniques, but the path between here and AGI won’t involve any fundamental paradigm shifts and won’t involve us learning any new deep things about intelligence’. I’m not sure what Eliezer’s unconditional view on this is, but I’d guess that he thinks this falls a lot in probability if we condition on something like ‘good outcomes are possible’—it’s very bad news.
- An ‘unprosaic but not radically different AGI’ thesis, like ‘AGI might involve new paradigm shifts and/or new deep insights into intelligence, but it will still be similar enough to the current deep learning paradigm that we can potentially learn important stuff about alignable AGI by working with deep nets today’. I don’t think Eliezer has a strong view on this, though I observe that he thinks some of the most useful stuff humanity can do today is ‘run various alignment experiments on deep nets’.
- An ‘AGI won’t be GOFAI’ thesis. Eliezer strongly endorses this.
There’s also an ‘inevitability thesis’ that I think is a crux here: my Eliezer-model thinks there are a wide variety of ways to build AGI that are very different, such that it matters a lot which option we steer toward (and various kinds of ‘prosaicness’ might be one parameter we can intervene on, rather than being a constant). My Paul-model has the opposite view, and endorses some version of inevitability.
What links here?
- adamShimi's comment on Discussion with Eliezer Yudkowsky on AGI interventions by Rob Bensinger (15 Nov 2021 14:52 UTC; 121 points)

Rob Bensinger 8 Jun 2021 0:49 UTC
65 points
on: Rob B’s Shortform Feed
Shared with permission, a google doc exchange confirming Eliezer still finds the arguments for alignment optimism, slower takeoffs, etc. unconvincing:
Daniel Filan: I feel like a bunch of people have shifted a bunch in the type of AI x-risk that worries them (representative phrase is “from Yudkowsky/Bostrom to What Failure Looks Like ~~part 2~~ part 1”) and I still don’t totally get why.
Eliezer Yudkowsky: My bitter take: I tried cutting back on talking to do research; and so people talked a bunch about a different scenario that was nicer to think about, and ended up with their thoughts staying there, because that’s what happens if nobody else is arguing them out of it.

That is: this social-space’s thought processes are not robust enough against mildly adversarial noise, that trying a bunch of different arguments for something relatively nicer to believe, won’t Goodhart up a plausible-to-the-social-space argument for the thing that’s nicer to believe. If you talk people out of one error, somebody else searches around in the space of plausible arguments and finds a new error. I wasn’t fighting a mistaken argument for why AI niceness isn’t too intractable and takeoffs won’t be too fast; I was fighting an endless generator of those arguments. If I could have taught people to find the counterarguments themselves, that would have been progress. I did try that. It didn’t work because the counterargument-generator is one level of abstraction higher, and has to be operated and circumstantially adapted too precisely for the social-space to be argued into it using words.

You can sometimes argue people into beliefs. It is much harder to argue them into skills. The negation of Robin Hanson’s rosier AI scenario was a belief. Negating an endless stream of rosy scenarios is a skill.
Caveat: this was a private reply I saw and wanted to share (so people know EY’s basic epistemic state, and therefore probably the state of other MIRI leadership). This wasn’t an attempt to write an adequate public response to any of the public arguments put forward for alignment optimism or non-fast takeoff, etc., and isn’t meant to be a replacement for public, detailed, object-level discussion. (Though I don’t know when/if MIRI folks plan to produce a proper response, and if I expected such a response soonish I’d probably have just waited and posted that instead.)

Rob Bensinger 16 Dec 2023 22:28 UTC
63 points
12
on: “AI Alignment” is a Dangerously Overloaded Term
From briefly talking to Eliezer about this the other day, I think the story from MIRI’s perspective is more like:
- Back in 2001, we defined “Friendly AI” as “The field of study concerned with the production of human-benefiting, non-human-harming actions in Artificial Intelligence systems that have advanced to the point of making real-world plans in pursuit of goals.”
We could have defined the goal more narrowly or generically than that, but that just seemed like an invitation to take your eye off the ball: if we aren’t going to think about the question of how to get good long-run outcomes from powerful AI systems, who will?
And many of the technical and philosophical problems seemed particular to CEV, which seemed like an obvious sort of solution to shoot for: just find some way to leverage the AI’s intelligence to solve the problem of extrapolating everyone’s preferences in a reasonable way, and of aggregating those preferences fairly.
- Come 2014, Stuart Russell and MIRI were both looking for a new term to replace “the Friendly AI problem”, now that the field was starting to become a Real Thing. Both parties disliked Bostrom’s “the control problem”. In conversation, Russell proposed “the alignment problem”, and MIRI liked it, so Russell and MIRI both started using the term in public.
Unfortunately, it gradually came to light that Russell and MIRI had understood “Friendly AI” to mean two moderately different things, and this disconnect now turned into a split between how MIRI used “(AI) alignment” and how Russell used “(value) alignment”. (Which I think also influenced the split between Paul Christiano’s “(intent) alignment” and MIRI’s “(outcome) alignment”.)
Russell’s version of “friendliness/alignment” was about making the AI have good, human-deferential goals. But Creating Friendly AI 1.0 had been very explicit that “friendliness” was about good behavior, regardless of how that’s achieved. MIRI’s conception of “the alignment problem” (like Bostrom’s “control problem”) included tools like capability constraint and boxing, because the thing we wanted researchers to focus on was the goal of leveraging AI capabilities to get actually-good outcomes, whatever technical work that requires, not some proxy goal that might turn out to be surprisingly irrelevant.
Again, we wanted a field of people keeping their eye on the ball and looking for clever technical ways to get the job done, rather than a field that neglects some actually-useful technique because it doesn’t fit their narrow definition of “alignment”.
- Meanwhile, developments like the rise of deep learning had updated MIRI that CEV was not going to be a realistic thing to shoot for with your first AI. We were still thinking of some version of CEV as the ultimate goal, but it now seemed clear that capabilities were progressing too quickly for humanity to have time to nail down all the details of CEV, and it was also clear that the approaches to AI that were winning out would be far harder to analyze, predict, and “aim” than 2001-Eliezer had expected. It seemed clear that if AI was going to help make the future go well, the first order of business would be to do the minimal thing to prevent other AIs from destroying the world six months later, with other parts of alignment/friendliness deferred to later.
I think considerations like this eventually trickled in to how MIRI used the term “alignment”. Our first public writing reflecting the switch from “Friendly AI” to “alignment”, our Dec. 2014 agent foundations research agenda, said:
We call a smarter-than-human system that reliably pursues beneficial goals “aligned with human interests” or simply “aligned.”
Whereas by July 2016, when we released a new research agenda that was more ML-focused, “aligned” was shorthand for “aligned with the interests of the operators”.
In practice, we started using “aligned” to mean something more like “aimable” (where aimability includes things like corrigibility, limiting side-effects, monitoring and limiting capabilities, etc., not just “getting the AI to predictably tile the universe with smiley faces rather than paperclips”). Focusing on CEV-ish systems mostly seemed like a distraction, and an invitation to get caught up in moral philosophy and pie-in-the-sky abstractions, when “do a pivotal act” is legitimately a hugely more philosophically shallow topic than “implement CEV”. Instead, we went out of our way to frame the challenge of alignment in a way that seemed almost comically simple and “un-philosophical”, but that successfully captured all of the key obstacles: ‘explain how to use an AI to cause there two exist two strawberries that are identical at the cellular level, without causing anything weird or disruptive to happen in the process’.
Since realistic pivotal acts still seemed pretty outside the Overton window (and since we were mostly focused on our own research at the time), we wrote up our basic thoughts about the topic on Arbital but didn’t try to super-popularize the topic among rationalists or EAs at the time. (Which unfortunately, I think, exacerbated a situation where the larger communities had very fuzzy models of the strategic situation, and fuzzy models of what the point even was of this “alignment research” thing; alignment research just become a thing-that-was-good-because-it-was-a-good, not a concrete part of a plan backchained from concrete real-world goals.)
I don’t think MIRI wants to stop using “aligned” in the context of pivotal acts, and I also don’t think MIRI wants to totally divorce the term from the original long-term goal of friendliness/alignment.
Turning “alignment” purely into a matter of “get the AI to do what a particular stakeholder wants” is good in some ways—e.g., it clarifies that the level of alignment needed for pivotal acts could also be used to do bad things.
But from Eliezer’s perspective, this move would also be sending a message to all the young Eliezers “Alignment Research is what you do if you’re a serious sober person who thinks it’s naive to care about Doing The Right Thing and is instead just trying to make AI Useful To Powerful People; if you want to aim for the obvious desideratum of making AI friendly and beneficial to the world, go join e/acc or something”. Which does not seem ideal.
So I think my proposed solution would be to just acknowledge that ‘the alignment problem’ is ambiguous between three different (overlapping) efforts to figure out how to get good and/or intended outcomes from powerful AI systems:
- intent alignment, which is about getting AIs to try to do what the AI thinks the user wants, and in practice seems to be most interested in ‘how do we get AIs to be generically trying-to-be-helpful’.
- “strawberry problem” alignment, which is about getting AIs to safely, reliably, and efficiently do a small number of specific concrete tasks that are very difficult, for the sake of ending the acute existential risk period.
- CEV-style alignment, which is about getting AIs to fully figure out how to make the future good.
Plausibly it would help to have better names for the latter two things. The distinction is similar to “narrow value learning vs. ambitious value learning”, but both problems (as MIRI thinks about them) are a lot more general than just “value learning”, and there’s a lot more content to the strawberry problem than to “narrow alignment”, and more content to CEV than to “ambitious value learning” (e.g., CEV cares about aggregation across people, not just about extrapolation).
(Note: Take the above summary of MIRI’s history with a grain of salt; I had Nate Soares look at this comment and he said “on a skim, it doesn’t seem to quite line up with my recollections nor cut things along the joints I would currently cut them along, but maybe it’s better than nothing”.)
What links here?
- Vladimir_Nesov's comment on The Gemini Incident by Zvi (23 Feb 2024 22:27 UTC; 2 points)

Rob Bensinger 15 Nov 2021 20:46 UTC
LW: 60 AF: 24
AF
on: Ngo and Yudkowsky on alignment difficulty
This is the first post in a sequence, consisting of the logs of a Discord server MIRI made for hashing out AGI-related disagreements with Richard Ngo, Open Phil, etc.
I did most of the work of turning the chat logs into posts, with lots of formatting help from Matt Graves and additional help from Oliver Habryka, Ray Arnold, and others. I also hit the ‘post’ button for Richard and Eliezer. (I don’t plan to repeat this note on future posts in this sequence, unless folks request it.)

Rob Bensinger 12 Nov 2021 18:01 UTC
LW: 60 AF: 24
AF
in reply to: adamShimi’s comment on: Discussion with Eliezer Yudkowsky on AGI interventions
From testimonials by a bunch of more ML people and how any discussion of alignment needs to clarify that you don’t share MIRI’s contempt with experimental work and not doing only decision theory and logic
If you were in the situation described by The Rocket Alignment Problem, you could think “working with rockets right now isn’t useful, we need to focus on our conceptual confusions about more basic things” without feeling inherently contemptuous of experimentalism—it’s a tool in the toolbox (which may or may not be appropriate to the task at hand), not a low- or high-status activity on a status hierarchy.
Separately, I think MIRI has always been pretty eager to run experiments in software when they saw an opportunity to test important questions that way. It’s also been 4.5 years now since we announced that we were shifting a lot of resources away from Agent Foundations and into new stuff, and 3 years since we wrote a very long (though still oblique) post about that research, talking about its heavy focus on running software experiments. Though we also made sure to say:
In a sense, you can think of our new research as tackling the same sort of problem that we’ve always been attacking, but from new angles. In other words, if you aren’t excited about logical inductors or functional decision theory, you probably wouldn’t be excited by our new work either.
I don’t think you can say MIRI has “contempt with experimental work” after four years of us mainly focusing on experimental work. There are other disagreements here, but this ties in to a long-standing objection I have to false dichotomies like:
- ‘we can either do prosaic alignment, or run no experiments’
- ‘we can either do prosaic alignment, or ignore deep learning’
- ‘we can either think it’s useful to improve our theoretical understanding of formal agents in toy settings, or think it’s useful to run experiments’
- ‘we can either think the formal agents work is useful, or think it’s useful to work with state-of-the-art ML systems’
I don’t think Eliezer’s criticism of the field is about experimentalism. I do think it’s heavily about things like ‘the field focuses too much on putting external pressures on black boxes, rather than trying to open the black box’, because (a) he doesn’t think those external-pressures approaches are viable (absent a strong understanding of what’s going on inside the box), and (b) he sees the ‘open the black box’ type work as the critical blocker. (Hence his relative enthusiasm for Chris Olah’s work, which, you’ll notice, is about deep learning and not about decision theory.)