Scott Alexander left an important reply to Rob Bensinger on X. I happen to agree with Scott. Here’s the original post by Rob:
In response to “What did EAs do re AI risk that is bad?”:
Aside from the obvious ‘being a major early funder and a major early talent source for two of the leading AI companies burning the commons’, I think EAs en masse have tended to bring a toxic combination of heuristics/leanings/memes into the AI risk space. I’m especially thinking of some combination of:
‘be extremely strategic and game-playing about how you spin the things you say, rather than just straightforwardly reporting on your impressions of things’
plus ‘opportunistically use Modest Epistemology to dismiss unpalatable views and strategies, and to try to win PR battles’.
Normally, I’m at least a little skeptical of the counterfactual impact of people who have worsened the AI race, because if they hadn’t done it, someone else might have done it in their place. But this is a bit harder to justify with EAs, because EAs legitimately have a pretty unusual combination of traits and views.
Dario and a cluster of Open-Phil-ish people seem to have a very strange and perverse set of views (at least insofar as their public statements to date represent their actual view of the situation):
---
1. AI is going to become vastly superhuman in the near future; but being a good scientist means refusing to speculate about the potential novel risks this may pose. Instead, we should only expect risks that we can clearly see today, and that seem difficult to address today.
If there is some argument for why a problem P might only show up at a higher capability level, or some argument for why a solution S that works well today will likely stop working in the future… well, those are just arguments. Arguments have a terrible track record in AI; the field is full of surprises. So we should stick to only worrying about things when the data mandates it. This is especially important to do insofar as it will help us look more credible and thereby increase our political power and influence.
2. When it comes to technical solutions to AI, the burden of proof is on the skeptic: in the absence of proof that alignment is intractable, we should behave as though we’ve got everything under control. At the same time, when it comes to international coordination on AI, we will treat the burden of proof as being on the non-skeptic. Absent proof that governments can coordinate on AI, we should assume that they can’t coordinate. And since they can’t coordinate, there’s no harm in us doing a lot of things to make coordination even harder, to make our lives a bit more convenient as we work on the technical problems.
3. In general, people worried about AI risk should coordinate as much as possible to play down our concerns, so as not to look like alarmists. This is very important in order to build allies and accumulate political influence, so that we’re well-positioned to act if and when an important opportunity arises.
If you’re claiming that now is an important opportunity, and that we should be speaking out loudly about this issue today… well, that sounds risky and downright immodest. Many things are possible, and the future is hard to predict! Taking political risks means sacrificing enormous option value. The humble and safe thing to do is to generally not make too much of a fuss, and just make sure we’re powerful later in case the need arises.
---
1-3 really does seem like an unusually toxic set of heuristics to propagate, potentially worse than replacement.
- In an engineering context, the normal mindset is to place the burden of proof on the engineer to establish safety. There’s no mature engineering discipline that accepts “you can’t prove this is going to kill a ton of people” as a valid argument.
The standard engineering mindset sounds almost more virtue-ethics-y or deontological rather than EA-ish—less “ehh it’s totally fine for me to put billions of lives at risk as long as my back-of-the-envelope cost-benefit analysis says the benefits are even greater!”, more “I have a sacred responsibility and duty to not build things that will bring others to harm.”
Certainly the casualness about p(doom) and about gambling with billions of people’s lives is something that has no counterpart in any normal scientific discipline.
- Likewise, I suspect that the typical scientist or academic that would have replaced EAs / Open Phil would have been at least somewhat more inclined to just state their actual concerns about AI, and somewhat less inclined to dissemble and play political games.
Scientists are often bad at such games, they often know they’re bad at such games, and they often don’t like those games. EAs’ fusion of “we’re playing the role of a wonkish Expert community” with “we’re 100% into playing political games” is plausibly a fair bit worse than the normal situation with experts.
- And EAs’ attempts to play eleven-dimensional chess with the Overton window are plausibly worse than how scientists, the general public, and policymakers normally react to any technology under the sun that sounds remotely scary or concerning or creepy: “Ban it!”
Governments are incredibly trigger-happy about banning things. There’s a long history of governments successfully coordinating to ban things dramatically less dangerous than superintelligent AI. And in fact, when my colleagues and I have gone out and talked to most populations about AI risk, people mostly have much more sensible and natural responses than EAs to this issue.
A way of summarizing the issue, I think, is that society depends on people blurting out their views pretty regularly, or on people having pretty simple and understandable agendas (e.g., “I want to make money” or “I want the Democrats to win”).
Society’s ability to do sense-making is eroded when a large fraction of the “specialists” talking about an issue are visibly dissembling and stretching the truth on the basis of agendas that are legitimately complicated and hard to understand.
Better would be to either exit the conversation, or contribute your actual pretty-full object-level thoughts to the conversation. Your sense of what’s in the Overton window, and what people will listen to, has failed you a thousand times over in recent years. Stop pretending at mastery of these tricky social issues, and instead do your duty as an expert and inform people about what’s happening.
I disagree with all of this on the epistemic level of “it’s not true”, and additionally disagree with your comms strategy of undermining EAs.
On the epistemic level—I haven’t seen EAs (other than SBF) do a lot of lying, equivocating, or even being particularly shy about their beliefs. I don’t know exactly who you’re talking about, but Holden made a personal blog post saying that his p(doom) was 50%, and said:
>>> “”I constantly tell people, I think this is a terrifying situation. If everyone thought the way I do, we would probably just pause AI development and start in a regime where you have to make a really strong safety case before you move forward with it.”
Dario said there’s a 25% chance “things go really, really badly”, and in terms of a pause:
>>> “I wish we had 5 to 10 years [before AGI]. The reason we can’t [slow down and] do that is because we have geopolitical adversaries building the same technology at a similar pace. It’s very hard to have an enforceable agreement where they slow down and we slow down. [But] if we can just not sell the chips to China, then this isn’t a question of competition between the U.S. and China. This is a question between me and Demis—which I am very confident we can work out.”
This is basically my position—I would add “we should try to negotiate with China, but keep this as a backup plan if it fails”, but my guess is Dario would also add this and just isn’t optimistic. I agree he’s written some other things (especially in Adolescence of Technology) that sound weirdly schizophrenic, and more on this later, but I give him a lot of credit for paragraphs like:
>>> “I think it would be absurd to shrug and say, “Nothing to worry about here!” But, faced with rapid AI progress, that seems to be the view of many US policymakers, some of whom deny the existence of any AI risks, when they are not distracted entirely by the usual tired old hot-button issues. Humanity needs to wake up, and this essay is an attempt—a possibly futile one, but it’s worth trying—to jolt people awake.”
Meanwhile, you seem to be treating all these people as basically equivalent to Gary Marcus. I think if you don’t mean these people in particular, you should specify who you’re talking about, and what things that they’ve said strike you in this way.
Absent that, I think this “debate” isn’t about OpenPhil or Anthropic failing to say they’re extremely worried, failing to say that catastrophe is a very plausible outcome, or failing to say that they think slowing down AI would be good if possible. It’s about OpenPhil in particular being pretty careful how they phrase things for public consumption. And I think any attempt to attack them for this should start with an acknowledgement that MIRI is directly responsible for all of our current problems by doing things like introducing DeepMind to its funders, getting Sam Altman and Elon Musk into AI, and building up excitement around “superintelligence” in Silicon Valley. I think if 2010-MIRI had slightly more strategicness and willingness to ask itself “hey, is this PR strategy likely to backfire?”, you might not have told a bunch of the worst people in the world that AI was going to be super-powerful and that whoever invested in it would be ahead in a race that might make them hundreds of billions of dollars (and yes, you did add “and then destroy the world”—but if you had been more strategic, you might have considered that investors wouldn’t hear that last part as loudly).
(you could argue that you’re not against strategicness in general, just talking about this one issue of saying cleanly that AI is very dangerous. But my impression is that Holden, Dario, have said this, many times—see examples above. What they haven’t said is “the situation is totally hopeless and every strategy except pausing has literally no chance of working”, but that isn’t a comms problem, that’s because they genuinely believe something different from you. And also, I frequently encountering people who say things like “Scott, I’m glad you wrote about X in way Y—it made me take AI risk seriously, after I’d previously been turned off of it by encountering MIRI”. I think a substantial reason that Dario’s writing sometimes seems schizophrenic when talking about AI risks is that he’s trying to convey that they’re serious while also trying to signal “I swear I’m not one of those MIRI people” so that his writing can reach some of the people you’ve driven away. I don’t think you drive them away because you’re “honest”, I think it’s just about normal issues around framing and theory-of-mind for your audience.)
I don’t actually want to re-open the “MIRI helped start DeepMind and OpenAI!!!” war or the “MIRI is arrogant and alienating!!! war—we’ve both been through both of these a million times—but I increasingly feel like a chump trying to cooperate while you’re defecting. This is the foundation of my comms worry. Your claim that “governments are incredibly trigger-happy about banning things...there’s a long history of governments successfully coordinating to ban things dramatically less dangerous than superintelligent AI” is too glib—I don’t think there’s ever been a ban on building something as economically-valuable and far-along as AI, executed competently enough that it would work if applied cookie-cutter to the AI situation. You’re trying to do a really difficult thing here. I respect this—all of our options are bad and unlikely to work, the situation is desperate, and I have no plan better than playing a portfolio of all the different desperate hard strategies in the hopes that one of them works. But my impression is that the rest of the field is executing this portfolio plan admirably, but MIRI and a few other PauseAI people are trying to sabotage every other strategy in the portfolio in the hope of forcing people into theirs.
(I think if you guys had your way, Anthropic would never have been founded, no safety-minded people would ever have joined labs, and the current world would be a race between XAI, Meta, and OpenAI, all of which would have a Yann LeCun style approach to safety, and none of which would have alignment teams beyond the don’t-say-bad-words level. We wouldn’t have the head of the leading AI lab writing letters to policymakers begging them to “jolt awake”, we wouldn’t have a substantial fraction of world compute going to Jan Leike’s alignment efforts, we wouldn’t have Ilya sitting on $50 billion for some super-secret alignment project—just Mark Zuckerberg stomping on a human face forever. In exchange, we would have won a couple more years of timeline, which would have been pointless, because timeline isn’t measured in distance from the year 1 AD, it’s measured in distance between some level of woken-up-ness and some point of danger, and the woken-up-ness would be pushed forward at the same rate the danger was.)
I support your fight-for-a-pause strategy in theory, and I would like to support it with praxis, but right now I feel very conflicted about this, because I worry that any support or oxygen you guys get will be spent knifing other safety advocates, while Sam Altman happily builds AGI regardless.
I think that both of these posts seem very confused about the dynamics of who says or thinks what, and I’m pretty sad about these posts.
Thoughts on Rob’s post
In general, I’ll note that I don’t think Rob really knows many of the OP people; I suspect he has spent <40 hours talking to them about any of this possibly ever. (This is in contrast to e.g. Habryka.) I don’t know where he’s getting his ideas about what the OP people think, but he seems incredibly confused and ignorant. (Eliezer seems similarly ignorant about who believes what.)
‘be extremely strategic and game-playing about how you spin the things you say, rather than just straightforwardly reporting on your impressions of things’ plus ‘opportunistically use Modest Epistemology to dismiss unpalatable views and strategies, and to try to win PR battles’.
I don’t really think this is true
Dario and a cluster of Open-Phil-ish people seem to have a very strange and perverse set of views
I wish Rob would be clear who he was referring to. Dario has beliefs that seem to me very different from most people who worked on the 2022 AI misalignment risk efforts at Open Phil. (I’m thinking of people like Holden Karnofsky, Ajeya Cotra, Joe Carlsmith, Lukas Finnveden, Tom Davidson. I’ll refer to this as “OP AI people” despite the fact that none of them work at Coefficient Giving (which OP renamed to).) Maybe Rob is talking about what Alexander Berger thinks?
(at least insofar as their public statements to date represent their actual view of the situation):
I think both Dario and Open Phil staff have been reasonably honest about their beliefs about catastrophic misalignment risk publicly, I think that Dario genuinely thinks it’s <5% and the OP AI people generally think it’s higher. (Tbc I think Dario’s take here is very bad!)
1. AI is going to become vastly superhuman in the near future; but being a good scientist means refusing to speculate about the potential novel risks this may pose. Instead, we should only expect risks that we can clearly see today, and that seem difficult to address today.
This is a reasonable statement of (a simple version of) the Dario/Jared/Anthropic position, but not the OP AI person position. The OP AI people were worried about AI misalignment and ASI enough to try to think it through in detail starting many years ago!
If there is some argument for why a problem P might only show up at a higher capability level, or some argument for why a solution S that works well today will likely stop working in the future… well, those are just arguments. Arguments have a terrible track record in AI; the field is full of surprises. So we should stick to only worrying about things when the data mandates it. This is especially important to do insofar as it will help us look more credible and thereby increase our political power and influence.
This is not what the OP people think, e.g. see 123. It’s a reasonable description of what Dario/Jared say.
2. When it comes to technical solutions to AI, the burden of proof is on the skeptic: in the absence of proof that alignment is intractable, we should behave as though we’ve got everything under control. At the same time, when it comes to international coordination on AI, we will treat the burden of proof as being on the non-skeptic. Absent proof that governments can coordinate on AI, we should assume that they can’t coordinate. And since they can’t coordinate, there’s no harm in us doing a lot of things to make coordination even harder, to make our lives a bit more convenient as we work on the technical problems.
This is not what the OP people think. I think it’s somewhat reasonable to accuse Anthropic of this.
3. In general, people worried about AI risk should coordinate as much as possible to play down our concerns, so as not to look like alarmists. This is very important in order to build allies and accumulate political influence, so that we’re well-positioned to act if and when an important opportunity arises.
I’ve never felt any pressure to play down my concerns from the OP people. For example, I’ve been in a lot of discussions about whether it’s better for MIRI to be more or less powerful or influential. To me, the main argument that it’s bad for MIRI to be more influential isn’t that MIRI is making a mistake by openly saying that risk is high. It’s that MIRI has beliefs about x-risk that are wrong on the merits which lead them to making unpersuasive arguments and bad recommendations, and they’re in some ways incompetent at communicating.
And I think this is not very representative of what Ant thinks. E.g. they don’t really think of themselves as coordinating with other AI-safety-concerned people.
If you’re claiming that now is an important opportunity, and that we should be speaking out loudly about this issue today… well, that sounds risky and downright immodest. Many things are possible, and the future is hard to predict! Taking political risks means sacrificing enormous option value. The humble and safe thing to do is to generally not make too much of a fuss, and just make sure we’re powerful later in case the need arises.
This is somewhere between “strawman” and “just totally confused as a description of what people believe”
Basically everything else in Rob’s post seems like a strawman.
Overall, I think this post is extremely confused, and Rob should be ashamed of writing such incredibly strawmanned things about what someone else thinks.
I recommend that people place very little trust in claims Rob makes about what other people believe. As someone who knows and talks regularly to the “Open Phil AI people”, I seriously think that Rob has no idea what he’s talking about when he ascribes arguments to them.
I guess there’s the question of what we are supposed to do if, in fact, the OP people agree with Rob’s version of their position but publicly deny that—at that point we’d have to do some brutal adjudication based on confusing private evidence or inferences from public actions and statements. I really don’t think that looking into that evidence would support Rob’s claims.
Thoughts on Scott’s post
I disagree with all of this on the epistemic level of “it’s not true”, and additionally disagree with your comms strategy of undermining EAs.
I don’t really think of Rob or MIRI as having a comms strategy of undermining EAs. I think Rob and Eliezer just say a bunch of false, wrong things about EAs because they’re mad at them for reasons downstream of the EAs not agreeing with Eliezer as much as Eliezer and Rob think would be reasonable, and a few other things.
On the epistemic level—I haven’t seen EAs (other than SBF) do a lot of lying, equivocating, or even being particularly shy about their beliefs.
Some EAs engage in equivocation and shyness about their beliefs; OP AI people less than many others.
Absent that, I think this “debate” isn’t about OpenPhil or Anthropic failing to say they’re extremely worried, failing to say that catastrophe is a very plausible outcome, or failing to say that they think slowing down AI would be good if possible.
I think Dario (like various other Anthropic people) does not believe that AI takeover is a very plausible outcome, and I think his position is indefensible on the merits, as are some of his other AI positions (e.g. his skepticism that there are substantial returns to intelligence above the human level, his skepticism that ASI could lead to 2x manufacturing capacity per year). He moderately disagrees with the OP people about this.
And I think any attempt to attack them for this should start with an acknowledgement that MIRI is directly responsible for all of our current problems by doing things like introducing DeepMind to its funders, getting Sam Altman and Elon Musk into AI, and building up excitement around “superintelligence” in Silicon Valley. I think if 2010-MIRI had slightly more strategicness and willingness to ask itself “hey, is this PR strategy likely to backfire?”, you might not have told a bunch of the worst people in the world that AI was going to be super-powerful and that whoever invested in it would be ahead in a race that might make them hundreds of billions of dollars (and yes, you did add “and then destroy the world”—but if you had been more strategic, you might have considered that investors wouldn’t hear that last part as loudly).
I don’t totally understand what point Scott is trying to make here, but I think this point is quite unfair.
(you could argue that you’re not against strategicness in general, just talking about this one issue of saying cleanly that AI is very dangerous. But my impression is that Holden, Dario, have said this, many times—see examples above. What they haven’t said is “the situation is totally hopeless and every strategy except pausing has literally no chance of working”, but that isn’t a comms problem, that’s because they genuinely believe something different from you.
Agreed
And also, I frequently encountering people who say things like “Scott, I’m glad you wrote about X in way Y—it made me take AI risk seriously, after I’d previously been turned off of it by encountering MIRI”. I think a substantial reason that Dario’s writing sometimes seems schizophrenic when talking about AI risks is that he’s trying to convey that they’re serious while also trying to signal “I swear I’m not one of those MIRI people” so that his writing can reach some of the people you’ve driven away. I don’t think you drive them away because you’re “honest”, I think it’s just about normal issues around framing and theory-of-mind for your audience.)
I think Scott is blaming MIRI much too much here. Dario’s main difficulty when arguing that he thinks AI will pose huge catastrophic risk in the next few years is that lots of people think this seems implausible on priors, not because those people were specifically turned off by MIRI making related arguments earlier. His core audience has never heard of MIRI.
But my impression is that the rest of the field is executing this portfolio plan admirably, but MIRI and a few other PauseAI people are trying to sabotage every other strategy in the portfolio in the hope of forcing people into theirs.
I think this is an incorrect read. Some people from PauseAI and MIRI criticize AI safety efforts a lot, often in ways I think are really dumb and counterproductive. But I don’t think they’re doing this as part of a strategy to force people into their strategies; it’s because of some combination of them genuinely (but perhaps foolishly) thinking that the other strategies are bad and/or the people executing them are corrupt.
(I think if you guys had your way, Anthropic would never have been founded, no safety-minded people would ever have joined labs, and the current world would be a race between XAI, Meta, and OpenAI, all of which would have a Yann LeCun style approach to safety, and none of which would have alignment teams beyond the don’t-say-bad-words level. We wouldn’t have the head of the leading AI lab writing letters to policymakers begging them to “jolt awake”, we wouldn’t have a substantial fraction of world compute going to Jan Leike’s alignment efforts, we wouldn’t have Ilya sitting on $50 billion for some super-secret alignment project—just Mark Zuckerberg stomping on a human face forever. In exchange, we would have won a couple more years of timeline, which would have been pointless, because timeline isn’t measured in distance from the year 1 AD, it’s measured in distance between some level of woken-up-ness and some point of danger, and the woken-up-ness would be pushed forward at the same rate the danger was.)
I disagree in a lot of the claims here about how various aspects of the current situation are good. (E.g. why does he think that Ilya is doing an alignment effort?)
I support your fight-for-a-pause strategy in theory, and I would like to support it with praxis, but right now I feel very conflicted about this, because I worry that any support or oxygen you guys get will be spent knifing other safety advocates, while Sam Altman happily builds AGI regardless.
It’s unclear what “you guys” means. I think Pause AI is making a variety of bad strategic choices. I think that knifing other safety advocates is one bad strategic choice, but it’s more like a bad choice that is downstream of my main problems with them, rather than my core concern about them. I think Rob is totally unreasonable and I wish he would stop working on AI safety, but I think he’s much worse than e.g. MIRI is overall. I think MIRI spends very little of their support on knifing AI safety advocates, they spend almost all of it on advocating for people being scared about misalignment risk and advocating for AI pauses (which I am generally in favor of). Eliezer totally does have a hobby of saying ridiculously strawmanny stuff about OP AI people, which I find pretty extremely annoying, but I don’t think it’s a big part of his effect on the world.
----
Overall, both posts seem to have substantially inaccurate pictures of what’s going on and what various actors think.
Thanks for writing this, Buck. I’m not going to try to reply to your whole post, because I think some of it is stuff I should chew on for longer and see whether I agree with it. But going through some of your points:
I definitely apologize for making it sound like I was making a harsher criticism of (the relevant parts of) EA than I intended. My tweet was originally written as a quick follow-up comment to someone who asked why I thought EA’s impact on AI x-risk was only ~55% likely to be positive. I turned it into a top-level tweet because I didn’t want to hide it deep in an existing discussion, but this was an error given I didn’t add extra context.
I also apologize for anything I said that made it sound like I was universally criticizing past or present Open Phil / cG staff (or centrally basing my views on first-hand conversations, for that matter). I already believed that tons of past and present rank-and-file OP/cG staff have very reasonable views, and I happily further update in that direction based on your and Oliver’s statements to that effect (e.g., Ollie’s “I have since updated that more people who are a level below Alexander, Dustin and Dario have more reasonable beliefs”).
I agree that my characterization of “Dario and a cluster of Open-Phil-ish people” was phrased in a needlessly confusing and sloppy way. I wanted to talk about a mix of ‘present-day views that seem to be endorsed by Dario and some other key figures’ and ‘general tendencies and memes that seem pretty widespread and that seem suspiciously related to choices EA leadership made many years ago’, but blurring these together is really unnecessarily confusing. Also, it didn’t help that I was sarcastically embedding my criticisms into my summaries of the views.
Insofar as my broad criticism of EA cultural trends/memes is correct (which I think is substantial), I still feel a fair bit of uncertainty about how to divvy up responsibility between more Open-Phil-ish people, more Oxford-ish people, MIRI / the rats, etc. And of course, some of the problem may stem from broader social-or-demographic factors that no EA leaders tried to engineer, and that even go counter to how leadership has tried to optimize. (I too remember the early speeches themed around “Keep EA Weird”, the early EA-leader conversations fretting about overly naive EA consequentialism, etc.)
Thanks, this is helpful and I basically accept most of what you’re saying. Some more specific comments on the part about me:
I don’t really think of Rob or MIRI as having a comms strategy of undermining EAs. I think Rob and Eliezer just say a bunch of false, wrong things about EAs because they’re mad at them for reasons downstream of the EAs not agreeing with Eliezer as much as Eliezer and Rob think would be reasonable, and a few other things.
I accept this criticism and take back my claim. I noticed that some people who worked for MIRI comms seemed to do this, and I assumed that anything said by enough MIRI comms people in a serious-sounding voice was on some level a MIRI communique. Eliezer has clarified that this isn’t true, so I apologize for saying it was.
I think Dario (like various other Anthropic people) does not believe that AI takeover is a very plausible outcome, and I think his position is indefensible on the merits, as are some of his other AI positions (e.g. his skepticism that there are substantial returns to intelligence above the human level, his skepticism that ASI could lead to 2x manufacturing capacity per year). He moderately disagrees with the OP people about this.
I basically agree with this (while wanting to clarify that I think he assigns a pretty high risk to permanent dictatorship or something along those lines) but I think he’s done an okay job of navigating uncertainty, realizing that even a low chance of human extinction is very bad, and being willing to (somewhat) cooperate and collect gains-from-trade with people who are doomier than he is. I see him as living in a consistent worldview next door to our movement’s (sort of like Vitalik or Dean Ball) and I think that, like those two people, he’s potentially somewhere between a friend / an ally-of-convenience / a negotiating partner, potentially convertible into a full ally if future events prove us right, or into a true enemy if we pre-emptively alienate him. Having someone like this in charge of a frontier lab is better than I expected (Demis might also be in this category, but I’m not sure, and worry that Larry and Sergey have final say).
I think Scott is blaming MIRI much too much here. Dario’s main difficulty when arguing that he thinks AI will pose huge catastrophic risk in the next few years is that lots of people think this seems implausible on priors, not because those people were specifically turned off by MIRI making related arguments earlier. His core audience has never heard of MIRI.
I agree that Dario is slightly being a jerk here, but I think that people have lots of stereotypes of “doomers” which derive from some real behavior of MIRI and PauseAI, and which wouldn’t exist if the median pause AI person was eg the median Constellation person, and I think Dario feels some understandable incentive to distance himself from this.
I disagree in a lot of the claims here about how various aspects of the current situation are good. (E.g. why does he think that Ilya is doing an alignment effort?)
I have no useful knowledge here, but Ilya seems genuinely alignment-pilled and terrified, the fact that he did the very courageous and self-sacrificing thing of trying to blow up OpenAI to try to get rid of Altman for what were mostly safety-related reasons speaks well of him, and IDK, he’s calling it “safe superintelligence” and saying he won’t release anything at all until he’s sure. I don’t claim any secret expertise in Ilya-ology but overall all of this seems encouraging and I’m surprised this part of my tweet attracted so much dissent.
It’s unclear what “you guys” means. I think Pause AI is making a variety of bad strategic choices. I think that knifing other safety advocates is one bad strategic choice, but it’s more like a bad choice that is downstream of my main problems with them, rather than my core concern about them. I think Rob is totally unreasonable and I wish he would stop working on AI safety, but I think he’s much worse than e.g. MIRI is overall. I think MIRI spends very little of their support on knifing AI safety advocates, they spend almost all of it on advocating for people being scared about misalignment risk and advocating for AI pauses (which I am generally in favor of). Eliezer totally does have a hobby of saying ridiculously strawmanny stuff about OP AI people, which I find pretty annoying, but I don’t think it’s a big part of his effect on the world.
I mostly accept your criticism that I should narrow my objections from “MIRI & Co” to “Pause.AI, Rob, maybe sort of Eliezer, & a slightly different co”. I don’t really know how to do this or what one word covers all of them without inflicting different forms of collateral damage (I don’t want to say “PauseAIers” because that also covers some people I like, and it feels extra-aggressive to name specific names), but I’m open to suggestion.
I’m generally sympathetic to Scott’s positions in this discussion, but I think he is probably very wrong about Ilya.
To the best of my knowledge, Safe Superintelligence has never published a single word about what they plan to do move alignment forward, which is pretty damning. in my opinion.
I have not heard of anyone who is known to be thoughtful about AI safety to have been hired to SSI, and I have not seen any position being advertised to AI safety people. People should correct me if I missed someone good joining SSI, but I think this is also a very bad sign.
My impression is that people who worked with Ilya at OpenAI don’t remember him as being particularly thoughtful about alignment, e.g. much less so than Jan Leike. This is a low confidence, third-hand impression, people can correct me if I’m wrong.
My impression is that the available evidence suggests that Ilya mostly took part in Altman’s firing for (perhaps justified) office politics grievances, and not primarily due to safety concerns. I also think that evidence points to his behavior during and after the incident being kind of cowardly.(I haven’t looked deeply into the details of the battle of the board, and it’s possible I’m wrong on this point, in which case I apologize to Ilya.) I’m also doubtful of how self-sacrificing think actions were—my best guess is that his current net worth is higher (at least on paper) than it would be if he stayed at OpenAI.
I expect that at some point SSI’s investors will grow impatient, and then SSI will start coming out with AI products (perhaps open-source to be cooler), just like everyone else. I don’t expect them to contribute too much to safety, though maybe Ilya will sometimes make some noises about the importance of safety in public speeches, which is nice I guess.
I’m pretty confident in my first two points, much less so in the next two, but I felt someone should respond to Scott on this point. Perhaps @Buck or someone else who expressed skepticism of Ilya’s project can add more information.
In general, I’ll note that I don’t think Rob really knows many of the OP people; I suspect he has spent <40 hours talking to them about any of this possibly ever.
I think you are overfitting Rob’s post to be about the wrong people. I think it’s much closer to accurate, if you actually read what he says, which is:
Dario and a cluster of Open-Phil-ish people
I think the things Rob is saying still have some strawman-y nature to them, but I think they are reasonably accurate descriptors of Anthropic leadership, plus my best guesses of what Alexander (head of Coefficient Giving) and Zach (head of CEA) believe, which seems well-described by “Dario and a cluster of Open-Phil-ish people”, and furthermore also of course constitutes an enormous fraction of the authority over broader EA.
I feel like almost all of your comment is just running with that misunderstanding and hence mostly irrelevant.
As you say yourself, almost no one in your list works at cG, or is in any meaningful position of authority at cG, so this feels like a bit of an absurd interpretation (I think trying to apply the things he is saying to Holden is reasonable, given Holden’s historical role in cG, and I do think he in the distant past said things much closer to this, but seems to have changed tack sometime in the past few years).
As you say yourself, almost no one in your list works at cG, or is in any meaningful position of authority at cG, so this feels like a bit of an absurd interpretation
A lot of Rob’s complaints are about things that happened in the past, so I don’t think it’s crazy to interpret him as talking about people who worked at CG in the past.
I think the things Rob is saying still have some strawman-y nature to them, but I think they are reasonably accurate descriptors of Anthropic leadership, plus my best guesses of what Alexander (head of Coefficient Giving) and Zach (head of CEA) believe, which seems well-described by “Dario and a cluster of Open-Phil-ish people”, and furthermore also of course constitutes an enormous fraction of the authority over broader EA.
I think that these people believe different things, and I don’t think Rob’s post particularly accurately describes any of them. For example, the Anthropic leadership doesn’t really think of themselves as trying to coordinate with AI safety people or trying to suppress them. I don’t think Alexander thinks “AI is going to become vastly superhuman in the near future” (and fwiw I don’t think Dario thinks that either, he doesn’t seem to believe in returns to intelligence substantially above human-level).
A lot of Rob’s complaints are about things that happened in the past, so I don’t think it’s crazy to interpret him as talking about people who worked at CG in the past.
Fair enough. I think that the people you list also used to believe things closer to what Rob is saying in the past, so at least we need to do a consistent comparison. Holden from 10 years ago seems to say a lot of the things that Rob is saying here, and Ajeya from a few years ago also said things more like this (more point 1 and 3, less point 2).
My guess is that it is worth digging up quotes here, but it’s a lot of work, so I am not going to do it for now, but if it turns out to be cruxy, I can.
(Again, I don’t think these are centrally the people Rob is talking about in either case. I think centrally he is talking about Anthropic, and then secondarily talking about how Open Phil people have related to Anthropic over the years, but I do still think his criticism is correct directionally for those people)
I don’t think Alexander thinks “AI is going to become vastly superhuman in the near future” (and fwiw I don’t think Dario thinks that either, he doesn’t seem to believe in returns to intelligence substantially above human-level).
I think Alexander abstractly believes that AI could very well become vastly superhuman in the near future, but yes, similar to Dario does not believe that speculating about such a thing in a non-scientific non-empirical way is appropriate, and as such they do not have coherent beliefs about this. Indeed, it seems like really a quite central match to what Rob is saying.
Ajeya from a few years ago also said things more like this (more point 1 and 3, less point 2).
I don’t remember anything like this. I think it might be misremembered or a strained interpretation.
Here are points 1 and 3 for reference:
1. AI is going to become vastly superhuman in the near future; but being a good scientist means refusing to speculate about the potential novel risks this may pose. Instead, we should only expect risks that we can clearly see today, and that seem difficult to address today.
3. In general, people worried about AI risk should coordinate as much as possible to play down our concerns, so as not to look like alarmists. This is very important in order to build allies and accumulate political influence, so that we’re well-positioned to act if and when an important opportunity arises.
I asked ChatGPT to read bioanchors (where I thought this was most likely to occur), and then to read all of her other writings looking for anything that fits that mode. Here’s its reply, not finding anything.
The closest match it finds is that Ajeya often caveats her claims. For example from bio anchors:
This is a work in progress and does not represent Open Philanthropy’s institutional view […] Accordingly, we have not done an official publication or blog post, and would prefer for now that people not share it widely in a low bandwidth way.
Huh, I am a bit confused about you summarizing that ChatGPT response that way. Maybe we are talking past each other, but Robby’s statements are not intended as the kind of statement that passes people’s ITT (which IMO is fine, frequently summaries of other people’s views should not pass their ITT, though it should ideally be caveated when this is going on).
Despite that, your ChatGPT transcript says:
Yes—there are clear resonances with both of your points, though mostly as counterpressures or explicit methodological caveats rather than direct endorsements. The strongest matches are in how Cotra frames forecasting discipline under radical uncertainty and how she handles communication norms around high-stakes speculative claims.
I am not expecting any direct endorsements of these statements (which are phrased as to make their internal contradictions most obvious), so this ChatGPT response seems compatible with what I am saying?
When I asked ChatGPT to “rephrase these two beliefs in more neutral language that would make more sense for someone to endorse (but try to pretty tightly imply the above)” it gave these two:
1. AI may become far more capable soon, but risk assessment should remain tightly tied to currently observable systems and evidence, not to conjectures about novel future dangers.
3. AI risk advocates should be selective and disciplined in how they present their concerns, emphasizing messages that are most likely to preserve credibility, attract allies, and strengthen their long-term influence.
Using Cotra’s public bio-anchors materials that I could directly inspect — especially her draft-report announcement, her long AXRP explanation of the framework, and later timeline/milestone essays — my read is: your first point gets a qualified yes, while your third point gets a strong yes.
But also, when we are in the domain of “evaluate whether Ajeya said things that imply the things above and result in other people getting the same vibe as the above”, then ChatGPT and Claude seem like much worse judges, so I think this question becomes more difficult to answer and I wouldn’t super defer to the language models (and is part of why I expected it would take a while to dig up quotes and do the work and stuff).
(If you want to complain that Robby should have caveated his stuff more as not being the kind of thing that passes people’s ITT, then I am happy to argue about that. I think a better post would have done it, but it’s not something I think is always necessary to do.)
(Also just for the sake of completeness, I don’t get this vibe from Ajeya at all these days and have no complaints on this front, besides probably still some strategic disagreement on stuff around point 3, but like at the level that I have with many people I respect almost certainly including you)
Ajeya from a few years ago also said things more like this (more point 1 and 3, less point 2)
I interpreted you as claiming that Ajeya had said “things more like:”
In general, people worried about AI risk should coordinate as much as possible to play down our concerns, so as not to look like alarmists. This is very important in order to build allies and accumulate political influence, so that we’re well-positioned to act if and when an important opportunity arises.
I don’t recall any examples of Ajeya saying or implying anything at all like that. I asked ChatGPT to try to find examples and I think it didn’t find anything.
In your ChatGPT session, a typical example it cites is:
In the AXRP discussion, she also says there were concerns that making the report seem too slick or official could increase capabilities interest.
I think those examples don’t meaningfully support the original claim, at least as a typical reader would understand it.
In your ChatGPT session, a typical example it cites is:
In the AXRP discussion, she also says there were concerns that making the report seem too slick or official could increase capabilities interest.
I think those examples don’t meaningfully support the original claim, at least as a typical reader would understand it.
I have no interest in defending ChatGPT’s claims here, and feel like I caveated that pretty explicitly. I agree that quote is largely irrelevant.
I asked ChatGPT to try to find examples and it didn’t find anything.
Yep, I agree with you that ChatGPT did not find any clear quotes (though it doesn’t look like ChatGPT tried very hard to find quotes). I disagree that it didn’t find “anything at all like that” (indeed ChatGPT is quite explicit that it found some things “kind of like that”).
I don’t recall any examples of Ajeya saying or implying anything at all like that.
I do. As I said, I could go and dig them up but it would take quite a while, and I am only like 75% confident they are written up as opposed to conversations, or private Google Docs or something that I would have trouble finding. It was a strong vibe I got at the time and I remember having a few conversations about adjacent conversations either with Ajeya or being about Ajeya.
Let me know if you want me to do this. I don’t quite know what’s at stake here for you, and I feel somewhat like we are talking past each other and before I do that it would be more productive to go up some meta-level, but I am not quite sure.
I think you’re right, and also it seems misleading / like a bad clustering to lump “the EAs” in with “Anthropic’s leadership”. I think those groups have some memetic connections, but they’re not the same group!
I feel like it’s more of a reasonable carving to lump in OpenPhil with “the EAs”, since they were/are effectively EA thought-leaders and they exerted a lot of influence, directly and indirectly.)
I think you’re right, and also it seems misleading / like a bad clustering to lump “the EAs” in with “Anthropic’s leadership”. I think those groups have some memetic connections, but they’re not the same group!
More than 50% of the talent-weighted safety people in EA are literally employees of Anthropic! The ex-CEO of Open Phil now works at Anthropic, and is married to one of its founders. These groups have enormous overlap.
Like, there is so enormous overlap, and the overlap results in such an enormous amount of de-facto deference (being an employee of a company is approximately the strongest common deference relationship we have) that it makes sense to think of these as closely intertwined.
Yes, there are people who attach the EA label themselves who are different here, sometimes even quite substantial clusters. But it’s also IMO clear from Scott’s response that he himself is also majorly deferring and is majorly supportive of Anthropic as a representative of EA, so this clearly isn’t just a split between “everyone who works at Anthropic and everyone who doesn’t”.
Rob used “Open Phil” exactly two times. One time saying “a cluster of Dario and Open-Phil-ish people” and another time “EAs / Open Phil” in reference to the broader community that includes all of these things. These seem like totally reasonable ways of using these pointers and words. I don’t have anything better. It’s definitely not “just Anthropic” as I think Scott very unambiguously demonstrates, and it would be of course extremely confusing to refer to Scott as “Anthropic”.
Imagine re Open Phil and hardcore rationalists “the ex-CEO of MIRI now works at Open Phil, and and the CEO of Lightcone is dating an Open Phil employee. These groups have enormous overlap.”
Yes. People can have a lot of social overlap, yet have very different views from one another, especially in the broader Bay Area intellectual ecosystem. My sense is that Anthropic leadership has very different views from most AI safety EAs.
More than 50% of the talent-weighted safety people in EA are literally employees of Anthropic!
Why do you think this? I’m skeptical this is true, especially if you’re including non-technical talent.
Why do you think this? I’m skeptical this is true, especially if you’re including non-technical talent.
IDK, I counted them? I made some spreadsheets over the years, and ran this number by a bunch of other people, and my current guess is that it’s around 55%? When I list organizations with full-time employees working in safety I actually end up at substantially above 50% of people working at Anthropic, but I think that’s overcounting.
My sense is that Anthropic leadership has very different views from most AI safety EAs.
I think there are differences and overlaps. I think Rob points to a thing that is shared across a cluster that spans both of them, and has historically had a lot of influence.
I think the things Rob is saying still have some strawman-y nature to them, but I think they are reasonably accurate descriptors of Anthropic leadership, plus my best guesses of what Alexander (head of Coefficient Giving) and Zach (head of CEA) believe. I feel like almost all of your comment is just running with that misunderstanding.
But aren’t Alexander Berger’s views not very relevant about OpenPhil’s AI strategy decisions from many years ago when their AI strategy and worldview—which I take to be very cose to the things Rob was criticizing—were worked out and started shaping the views of EAs in OpenPhil’s orbit?
Even now, when people criticize things OpenPhil has done in the past in the AI landscape, or criticize their general worldview and takes on AI risk (as it was developed in influential pieces of writing), I am by default automatically viewing it as criticism of Holden, Ajeya Cotra, Tom Davidson, Joe Carlsmith, etc. If people don’t intend me to interpret them that way, please be more clear. 🙂
I’m aware that, separately, OpenPhil/Coefficient Giving has undergone quite a transition and that you clashed badly with Dustin M. I think that’s very sad and unfortunate, but I think of these as quite distinct things and I never assumed that the thing with Dustin M. had anything to do with OpenPhil’s AI strategy decisions in (say) five years ago (edit: sorry that sounds like a strawman, but I mean something like “I’m not sure the same cause explains why some people who were at OpenPhil in the past found MIRI epistemically off-putting, and why Dustin M finds the rationalists to be a reputation risk & thinks reputation risks are unusually bad compared to other bad things.”) I could be wrong, of course, and maybe you think the org has a general thing of them of valuing “reputability” and “playing politics” too much. I just want to note that it’s not obvious how much these things are connected/caused by one “OpenPhil culture,” vs being about distinct things. (I think some of these are maybe directionally accurate as criticism, btw.)
I’m sure this is obvious to everyone involved, but I also just want to point out that when a lot of senior people leave, organizations can change really a lot, so it would be weird to speak of OpenPhil/Coefficient Giving now as though it were obviously still the same entity/culture.
But aren’t Alexander Berger’s views not very relevant about OpenPhil’s AI strategy decisions from many years ago when their AI strategy and worldview—which I take to be very cose to the things Rob was criticizing—were worked out and started shaping the views of EAs in OpenPhil’s orbit?
I think Holden at the time believed something closer to what Rob says here (though it’s still not an amazing fit), and more generally, I think “the beliefs of the successor CEO” are actually a better proxy for “the vibes of the broader ecosystem you are part of” than “the beliefs of the founder CEO”. I could go into more detail on my beliefs on this, though I think the argument is reasonably intuitive.
but I think of these as quite distinct things and I never assumed that the thing with Dustin M. had anything to do with OpenPhil’s AI strategy decisions in (say) five years ago
Yep, I think they are highly related. Indeed, I was predicting things like the Dustin thing without any knowledge of Dustin’s specific beliefs, and my predictions were primarily downstream of seeing how Anthropic’s position within the ecosystem was changing, and a broader belief-system that I think is shared by many people in leadership, not just Dustin.
I have since updated that more people who are a level below Alexander, Dustin and Dario have more reasonable beliefs, but also updated that those things end up mattering surprisingly little for what actually ends up a strategic priority.
I just want to note that it’s not obvious how much these things are connected/caused by one “OpenPhil culture,” vs being about distinct things. (I think some of these are maybe directionally accurate as criticism, btw.)
I think the “OpenPhil culture” thing is a distraction. In my model of the world most of this is downstream of people being into power-seeking strategies mostly from a naive-consequentialist lens, which is not that unique to OpenPhil within EA (and if anything OpenPhil has some of the people with the best antibodies to this, though also a lot of people who think very centrally along these lines, more concentrated among current leadership).
I think some of the people who are best at thinking independently about stuff, and are pretty good at not getting swept up in the power-seeking stuff, work at Open Phil. I think Holden genuinely helped with some of the correct cultural pieces, and my current belief is that if he wasn’t under the most pressure that anyone is, that he would probably have a relatively sane relationship to Anthropic as a result of it, though I am not as confident I am about that as I am that he had a bunch of quite good cultural pieces that help people be less naively power-seeking here.
Honestly, this is such a bad reply by Scott that I… don’t quite know whether I want to work on all of this anymore.
If this is how this ecosystem wants to treat people trying their hardest to communicate openly about the risks, and who are trying to somehow make sense of the real adversarial pressures they are facing, then I don’t think I want anything to do with it.
I have issues with Rob’s top-level tweet. I think it gets some things wrong, but it points at a real dynamic. It’s kind of strawman-y about things, and this makes some of Scott’s reaction more understandable, but his response overall seems enormously disproportionate.
Scott’s response is extremely emblematic of what I’ve experienced in the space. Simultaneous extreme insults and obviously bad faith arguments (“actually, it’s your fault that Deepmind was founded because you weren’t careful enough with your comms”), and then gaslighting that no one faces any censure for being open about these things (despite the very thing you are reading being extremely aggro about the lack of strategic communication), and actually we should be happy that Ilya started another ASI lab, and that Jan Leike has some compute budget.
The whole “no you are actually responsible for Deepmind” thing, in a tweet defending that it’s great that all of our resources are going into Anthropic, is just totally absurd. I don’t know what is going on with Scott here, but this is clearly not a high-quality response.
Copying my replies from Twitter, but I am also seriously considering making this my last day. It’s not the kind of decision to be made at 5AM in the morning so who knows, but seriously, fuck this.
IMO this doesn’t seem like the kind of response you will endorse in a few days, especially the “You are responsible for Deepmind/OpenAI” part.
You were also talking about AI close to the same time, and you’ve historically been pretty principled about this kind of stance.
you could argue that you’re not against strategicness in general, just talking about this one issue of saying cleanly that AI is very dangerous.
Robby at least has been very consistent on this that he is against most forms of strategic communication in general.
I also think you are against many forms of strategic communication in general? Your writing explores many of the relevant considerations in a lot of depth, and you certainly have not shied away from sharing your opinion on controversial issues, even when it wasn’t super clear how that is going to help things.
I think you are just arguing the wrong side of this specific argument branch. My model of Eliezer, Nate and Robby all have been pretty consistent that being overly strategic in conversation usually backfires. Of course you shouldn’t have no strategy, and my model of Eliezer in-particular has been in the past too strategic for my tastes and so might disagree with this, but I am pretty confident Robby himself is just pretty solidly on the “it’s good to blurt out what you believe, *especially* if you don’t have any good confident inside view model about how to make things better”.
In exchange, we would have won a couple more years of timeline, which would have been pointless, because timeline isn’t measured in distance from the year 1 AD, it’s measured in distance between some level of woken-up-ness and some point of danger, and the woken-up-ness would be pushed forward at the same rate the danger was.
I feel like we both know this is a strawman. The key thing at least in recent years that Rob, Eliezer and Nate have been arguing for is the political machinery necessary to actually control how fast you are building ASI, and the ability to stop for many years at a time, and to only proceed when risks actually seem handled.
If anything, Eliezer, Nate and Robby have been actively trying to move political will from “a pause right now” to “the machinery for a genuine stop”.
This makes this comparison just weird. Yes, according to everyone’s models the only time you might have the political will to stop will be in the future. I have never seen Nate or Eliezer or Robby say that they expect to get a stop tomorrow. But they of course also know that getting in a position to stop takes a long time, and the right time to get started on that work was yesterday.
So if they had their way (with their present selves teleported back in time) is that we would have more draft treaties, more negotiation between the U.S. and China. More materials ready to hand congress people who are trying to grapple with all of this stuff. Essays and books and movies and videos explaining the AI existential risk case straightforwardly to every audience imaginable.
That is what you could do if you took the 200+ risk-concerned people who ended up instead going to work at Anthropic, or ended up trying to play various inside-game politics things at OpenAI.
And man, I don’t know, but that just seems like a much better world. Maybe you disagree, which is fine, but please don’t create a strawman where Robby or Nate or Eliezer were ever really centrally angling for a short-termed pause that would have already passed by-then.
And then even beyond that, I think if you don’t know how to solve a problem, I think it is generally the virtuous thing to help other people get more surface area on solving it. Buying more time is the best way to do that, especially buying time now when the risks are pretty intuitive. I think you believe this too, and I don’t really know what’s going with your reaction here.
But my impression is that the rest of the field is executing this portfolio plan admirably, but MIRI and a few other PauseAI people are trying to sabotage every other strategy in the portfolio in the hope of forcing people into theirs.
Come on man, a huge number of people we both respect have recently updated that the kind of direct advocacy that MIRI has been doing has been massively under-invested in. I do not think that “other people are executing this portfolio plan admirably”, and this is just such a huge mischaracterization of the dynamics of this situation that I don’t know where to start.
“If Anyone Builds It, Everyone Dies” is a straightforward book. It doesn’t try to sabotage every other strategy in the portfolio, and I have no idea how you could characterize really any of the media appearances of Nate this way.
This is of course in contrast to Open Phil defunding almost everyone who has been pursuing this strategy and making mine and tons of other people’s lives hell, and all kinds of complicated adversarial shit that I’ve been having to deal with for years, where absolutely there have been tons of attempts to sabotage people trying to pursue strategies like this.
Like man, we can maybe argue about the magnitude of the errors here, and the sabotage or whatever, but trying to characterize this as some kind of “Nate, Eliezer, Robby are defecting on other people trying to be purely cooperative” seems absurd to me. I am really confused what is going on here.
We wouldn’t have the head of the leading AI lab writing letters to policymakers begging them to “jolt awake”, we wouldn’t have a substantial fraction of world compute going to Jan Leike’s alignment efforts, we wouldn’t have Ilya sitting on $50 billion for some super-secret alignment project
I am sympathetic to the first of these (but disagree you are characterizing Dario here correctly).
But come on, clearly Ilya sitting on $50 billion for starting another ASI company is not good news for the world. I don’t think you believe that this is actually a real ray of hope.
(And then I also don’t think that Jan Leike having marginally more compute is going to help, but maybe there is a more real disagreement here)
Overall, I am so so so tired of the gaslighting here.
If this is how this ecosystem wants to treat people trying their hardest to communicate openly about the risks, and who are trying to somehow make sense of the real adversarial pressures they are facing, then I don’t think I want anything to do with it.
I don’t think Scott speaks for the ecosystem. He’s just a guy in it, and one who isn’t even that closely connected to Anthropic or Coefficient Giving people. (E.g. you spend >10x as much time talking to people from those orgs as he does.) I think that the people in the ecosystem you’re criticizing would not approve of Scott’s post.
This is of course in contrast to Open Phil defunding almost everyone who has been pursuing this strategy and making mine and tons of other people’s lives hell, and all kinds of complicated adversarial shit that I’ve been having to deal with for years, where absolutely there have been tons of attempts to sabotage people trying to pursue strategies like this.
I think this is not a good summary of what Coefficient Giving has done. (I do think it really sucks that they defunded Lightcone.)
I think that the people in the ecosystem you’re criticizing would not approve of Scott’s post.
I think this is false. I expect Scott’s post to be heavily upvoted, if it was posted to the EA Forum to have an enormously positive agree/disagree ratio, and in-general for people to believe something pretty close to it.
There are a few exceptions (somewhat ironically a good chunk of the cG AI-risk people), but they would be relatively sparse. I think this is roughly what someone who is smart, but doesn’t have a strong inside-view take about what they should do about AI-risk believes that they should act like if they want to be a good member of the EA community. My guess is it’s also pretty close to what leadership at cG, CEA and Anthropic believe, plus it would poll pretty well at a thing like SES.
He’s just a guy in it, and one who isn’t even that closely connected to Anthropic or Coefficient Giving people.
The issue is of course not that Scott is right or wrong about what Anthropic or cG people believe. The issue is that he seems to be taking a view where you should be super strategic in your communications, sneer at anyone who is open about things, and measure your success in how many of your friends are now at the levers of power.
I think this is not a good summary of what Coefficient Giving has done.
I think cG’s funding decisions were really very centrally about trying to punish people who weren’t being strategic in their communications in the way that Dustin wanted them to be strategic in their communication’s.
I think other “all kinds of complicated adversarial shit” has also happened, though it’s harder to point to. At a minimum I will point to the fact that invitation decisions to things like SES have followed similar adversarial “you aren’t cooperating with our strategic communications” principles.
I think this is false. I expect Scott’s post to be heavily upvoted, if it was posted to the EA Forum to have an enormously positive agree/disagree ratio, and in-general for people to believe something pretty close to it.
The EA Forum is a trash fire, so who knows what would happen if this was published there.
My read of the social dynamics is that in places where people are inclined to defer to me or people like me, they might initially approve of the Scott thing for bad tribal reasons, but change their mind when they read criticism of it from me or someone like me (which is ofc part of why I sometimes bother commenting on things like this).
My guess is it’s also pretty close to what leadership at cG, CEA and Anthropic believe, plus it would poll pretty well at a thing like SES.
I think that Scott’s post would not overall be received positively by those people. Maybe you’re saying that one of the directions argued for by Scott’s post is approved of by those people? I agree with that more.
My read of the social dynamics is that in places where people are inclined to defer to me or people like me, they might initially approve of the Scott thing for bad tribal reasons, but change their mind when they read criticism of it from me or someone like me
Well, I mean, that is a hard conditional to be false since if people were to not change their mind, this would largely invalidate the premise that they are declined to defer to you. Unfortunately, I both think the vast majority of places in EA do not defer to you or people like you, and furthermore, I also think you are pretty importantly wrong about your criticisms, so I don’t quite know how to feel about this.
I do think it helps and am marginally happy about your cultural influence here (though it’s tricky, I also think a bunch of your takes here are quite dumb). I think the vast majority of the cultural influence here is downstream of not quite anyone in-particular, but more Anthropic than anywhere else, and neither you nor me can change that very much.
I think that Scott’s post would not overall be received positively by those people.
Yeah, I expect it to be straightforwardly positively received. I think people will be like “some parts of this seem dumb, the Ilya thing in-particular, but yeah, fuck those rationalists and MIRI people, I am with Scott on that”.
To be clear, I am not expecting consensus here, I think this will be what 75% of people who have any opinion at all on anything adjacent on this believe, but I expect people would broadly think it’s a good contribution that properly establishes norms and reflects how they think about things.
I also think it’s plausible people would be like “wow, what an uncough way that both of these people are interfacing with each other, please get away from each other children”, but then actually if you talked to them afterwards, they would be like “yeah, I mean, that was a bit of a shitshow but I do think Scott was basically right here (minus 1-2 minor things)”.
I am not enormously confident on this, but it matches my experiences of the space.
I agree with Habryka that absent criticism Scott’s post would be well received by an important group of people reasonably characterized as EA-ish AI safety people.
Imo absent criticism Rob’s post would be well received by a different group of people reasonably characterized as doomers. (Literally right before seeing this thread I saw another post on LW that is directionally correct but is mostly wrong or exaggerated in its details, and that was very well received.)
Both posts are broadly wrong about lots of things, about equally so, such that most people would be better off having never encountered either of them.
Tbc, my first-order intuitive impression is that Scott’s post is much more directionally accurate. But I expect that is because I constantly experience people knifing me, pushing me to take strategies that systematically destroy my ability to do anything while gaining approximately no safety benefit, or making claims about members of groups that include me that are false of me, whereas I don’t really experience any of the stuff that Rob gestures at, even though I expect it exists. Though Rob’s post doesn’t actually inform me of it, because his actual claims are false, and I cannot infer the underlying experiences that led him to make them. Another example of trapped priors if you don’t have second order corrections. (Tbc his follow-up post makes this substantially clearer.)
You probably already know I think this, but imo you should both quit on making public discourse in the AI safety community non-insane, and do other things that have a shot at working. (Since I know this will be misinterpreted by other readers, let me be clear that there are plenty of other kinds of public writing that do not fall in that bucket which I do think are worth doing.)
I endorse you taking the space to figure out how you want to relate and doing what’s right for you, I’ve increasingly updated to thinking that people doing things they’re not wholeheartedly behind tends to be net bad in all sorts of sideways ways, but the effort would be weaker for your loss. Wherever you end up, I appreciate you having taken the strategy of speaking in public about things that usually aren’t in a way that helped clarify the strategic situation for me many times.
(also, it’s scary to see three of the people I’d put in the upper tiers of good communication and understanding where we’re at with AI technically get into this intense conflict. I’m going to be thinking on this some and seeing if anything crystalizes which might help specifically, but in the meantime a few more general-purpose posts that might be useful memes for minimizing unhelpful conflict are A Principled Cartoon Guide to NVC, NVC as Variable Scoping, and Why Control Creates Conflict, and When to Open Instead)
I really don’t think Scott is gaslighting you. I think Scott is being honest here, but you should model him as having somewhat snapped. Pause AI and MIRI-adjacent people on X have been extremely adversarial and have been contributing to very bad discourse (even arguments-wise). I think Scott saw Rob’s post as very strawmannish and needlessly adversarial, and he more or less correctly lumped it in with this rising tide of terribleness, even if MIRI itself is definitely not as guilty. I might well be wrong about the specifics, but Scott Alexander isn’t the kind of person who tends to gaslight.
I think you need to be a lot more deflationary about the g-word. If you think, “But ‘gaslighting’ is something Bad people do; Scott Alexander isn’t Bad, so he would never do that”, well, that might be true depending on what you mean by the g-word. But if the behavior Habryka is trying to point to with the word to is more like, “Scott is adopting a self-serving narrative that minimizes wrongdoing by his allies and inflates wrongdoing by his rivals” (which is something someone might do without being Bad due to having “somewhat snapped”), well, why wouldn’t the rivals reach for the g-word in their defense? What is the difference, from their perspective?
“Gaslighting” should probably be avoided because it is anywhere between meaningless and a fighting word depending on who says it and how.
The g-word is a very nasty accusation. It gets thrown around and means a bunch of stuff down to just “saying stuff I disagree with”, but it shouldn’t.
It is originally a conscious, malicious attempt to drive someone insane by strategically lying to them.
On the substance, people are honest but wrong an awful lot, and honest but massively overstating their case even more often. Assuming your rivals are malicious or dishonest when they’re just wrong or overstating is a huge source of conflict and thereby confusion.
It’s a really useful pointer towards a tactic that is relatively widespread and has no better word. I am personally happy to use other words, but I have the sense that sentences like “I am so very very tired of the ambiguous but ultimately strategic enough attempts at undermining my ability to orient in this situation by denying pretty clearly true parts of reality combined with intense implicit threats of consequences if I indicate I believe the wrong thing that might or might not be conscious optimizations happening in my interlocutors but have enough long-term coherence to be extremely unlikely to be the cause of random misunderstandings” would work that well.
Yeah I would call that “gaslighting”. It looks like my initial interpretation of what you meant by it is closer than Zack’s. I think Scott isn’t doing that. I’m inclined to believe you when you say other people have behaved this way.
In exchange, we would have won a couple more years of timeline, which would have been pointless, because timeline isn’t measured in distance from the year 1 AD, it’s measured in distance between some level of woken-up-ness and some point of danger, and the woken-up-ness would be pushed forward at the same rate the danger was.
I feel like we both know this is a strawman. The key thing at least in recent years that Rob, Eliezer and Nate have been arguing for is the political machinery necessary to actually control how fast you are building ASI, and the ability to stop for many years at a time, and to only proceed when risks actually seem handled.
If anything, Eliezer, Nate and Robby have been actively trying to move political will from “a pause right now” to “the machinery for a genuine stop”.
I think Scott’s “couple more years” wasn’t referring to a belief that EA could have successfully advocated for a couple of year pause, but rather referring to the change in timeline you’d have gotten if safety-sympathetic people refused to work on stuff that increases the pace of capabilities progress.
Oh, I see. That makes sense, I agree I misunderstood this part to be about something else (though I disagree similarly strongly with the correct interpretation, but it’s still good to clear that up).
trying to characterize this as some kind of “Nate, Eliezer, Robby are defecting on other people trying to be purely cooperative” seems absurd to me. I am really confused what is going on here.
Everything makes sense when you meditate on how the line between “cooperation” and “defection” isn’t in the territory; it’s a computed concept that agents in a variable-sum game have every incentive to “disagree” (actually, fight) about.
Consider the Nash demand game. Two players name a number between 0 and 100. If the sum is less than or equal to 100, you get the number you named as a percentage of the pie; if the sum exceeds 100, the pie is destroyed. There’s no unique Nash equilibrium. It’s stable if Player 1 says 50 and Player 2 says 50, but it’s also stable if Player 1 says 35 and Player 2 says 65 (or generally n and 100 − n, respectively).
The secret is that there are no natural units of pie (or, equivalently, how much pie everyone “deserves”). Everyone thinks that they’re being “cooperative” and that their partners are “defecting”, because they’re counting the pie differently: Player 1 thinks their slice is 35%, but Player 2 thinks the same physical slice is 65%.
If you don’t think your partner is treating you fairly, your leverage is to threaten to destroy surplus unless they treat you better. That’s what Alexander is doing when he says, “I would like to support it with praxis, but right now I feel very conflicted about this”. He’s saying, “You’d better give me a bigger slice, Player 1, or I’ll destroy some of the pie.”
That’s also what your brain is doing when you say you don’t want to work on this anymore. Scott doesn’t want you to quit! (Partially because he values Lightcone’s work, and partially because it would look bad for him if you can publicly blame your burnout on him.) Crucially, your brain knows this. By threatening to quit in frustration, you can probably get Scott to apologize and give your arguments a fairer hearing, whereas in the absence of the threat, he has every incentive to keep being motivatedly dumb from your perspective.
You have a strong hand here! The only risk is if your counterparties don’t think you’d ever actually quit and start calling your bluff. In this case, we know Scott is a pushover and will almost certainly fold. But if you ever face stronger-willed counterparties, you might need to shore up the credibility of your threat: conspicuously going on vacation for a week to think it over will get taken more seriously than an “I don’t know if I want to do this anymore” comment.
(Sorry, maybe you already knew all that, but weren’t articulating it because it’s not part of the game? I don’t think I’m worsening your position that much by saying it out loud; we know that Scott knows this stuff.)
That’s also what your brain is doing when you say you don’t want to work on this anymore. Scott doesn’t want you to quit! (Partially because he values Lightcone’s work, and partially because it would look bad for him if you can publicly blame your burnout on him.) Crucially, your brain knows this.
Man, I really wish this was the case, and it’s non-zero of what is going on, but the vast majority of what I am expressing with my (genuine) desire to quit is the stress and frustration associated with the gaslighting, which is one level more abstract than the issue you talk about.
Like yes, there is a threat here being like “for fuck’s sake, stop gaslighting or I am genuinely going to blow up my part of the pie”, but it’s not actually about the object level, and I don’t actually have much of any genuine hope of that working in the same way one might expect from a negotiation tactic.
I am just genuinely actually very tired, and Scott changing his mind on this and going “oh yeah, actually you are right” actually wouldn’t do much to make me want to not quit, because it wouldn’t address the continuous gaslighting where every time anyone tries to talk about any of the adversarial dynamics, they immediately get told this is all made up and get repeated “I haven’t seen EAs (other than SBF) do a lot of lying, equivocating, or even being particularly shy about their beliefs” and “everyone is being honest all the time and actually it’s just you who is lying right now and always”.
Yeah, the frustrating part is almost always on a meta level. I think Zack’s point about “No natural units of pie” applies to the gaslighting issue as well though. Asserting one’s viewpoint means asserting it as truth which invalidates differing perspectives. “I disagree, you contradict, he gaslights”.
It’s difficult because sometimes the gas lights really don’t seem to be dimming, and sometimes that perception is downstream of some motivated thinking because I really don’t want to believe we’re running out of oil already, dammit. And so the result is simultaneously kinda an honest statement of perspective (at least, as honest as these tend to get) while also being a (not-necessarily-consciously) motivated action pushing people to disregard their own senses. And then we have to decide how to judge this mess of bias and honesty, and if we don’t judge such that the product after a round trip of perceiving C/D and responding accordingly we get more C than last time… shit’s fucked. And without objective units of pie that people can agree on when judging who was in the wrong.
So like… am I trying to gaslight people into questioning their own sanity so they accept what I want them to accept, or am I just flinching away from what scares me, like we all do? Both, and the question of whether I deserve the leniency and empathy is a difficult one, because what are the units of this pie and where’s the objective cutoff? And because our tolerance for further bullshit tends to diminish after accumulating bullshit, so it gets even more difficult to get back to the other side of criticality.
“It is not the critic who counts: not the man who points out how the strong man stumbles or where the doer of deeds could have done better. The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood, who strives valiantly, who errs and comes up short again and again, because there is no effort without error or shortcoming, but who knows the great enthusiasms, the great devotions, who spends himself for a worthy cause; who, at the best, knows, in the end, the triumph of high achievement, and who, at the worst, if he fails, at least he fails while daring greatly, so that his place shall never be with those cold and timid souls who knew neither victory nor defeat.”
Theodore Roosevelt”Citizenship in a Republic,”Speech at the Sorbonne, Paris, April 23, 1910
To clarify the claim I’m making: I’m not trying to throw EA under a bus. This thread spun off from a discussion where I said I thought EA’s net impact on AI x-risk was probably positive, but I was highly uncertain.
Somebody asked what the bad components of EA’s impact were, and I went off on Anthropic, and on EA’s (and especially OpenPhil’s) entanglement with the company and their support for Anthropic’s operations. (To the extent that a lot of x-risk-adjacent EA seems to function, in practice, as a talent pipeline for Anthropic.)
I also said that I think OpenPhil’s bet on OpenAI was a disaster. And I said that there’s a culture of caginess, soft-pedaling, and trying-to-sound-reassuringly-mundane that I think has damaged AI risk discourse a fair amount, and that various people in and around OpenPhil have contributed to.
I’m restating this partly to be clear about what my exact claims are. E.g., I’m not claiming that items 1+2+3 are things OpenPhil and Anthropic leadership would happily endorse as stated. I deliberately phrased them in ways that highlight what I see as the flaws in these views and memes, in the hope that this could help wake up some people in and around OpenPhil+Anthropic to the road they’re walking.
This may have been the wrong conversational tack, but my vague sense is that there have been a lot of milder conversations about these topics over the years, and they don’t seem to have produced a serious reckoning, retrospective, or course change of the kind I would have expected.
I hoped it was obvious from the phrasing that 1-3 were attempting to embed the obvious critiques into the view summary, rather than attempting to phrase things in a way that would make the proponent go “Hell yeah, I love that view, what a great view it is!” If this confused anyone, I apologize for that.
I wasn’t centrally thinking of Holden’s public communication in the OP, though I think if he were consistently solid at this, Aysja Johnson wouldn’t have needed to write this in response to Holden’s defense of Anthropic ditching its core safety commitments.
“Dario said there’s a 25% chance ‘things go really, really badly’”
I feel like this is a case in point. Like, sure, counting up from 0 (“the average corporation building the average product doesn’t try to warn the public about their product, except in ways mandated by law!”), Anthropic’s doing great. Or if the baseline is “is Anthropic doing better than pathological liar Sam Altman?”, then sure, Anthropic is doing better than OpenAI on candor.
If we’re instead anchoring to “trying to build a product that massively endangers everyone in the world is an incredibly evil sort of thing to do by default, and to even begin to justify it you need to be doing a truly excellent job of raising the loudest possible alarm bells alongside dozens of other things”, then I don’t think Anthropic is coming close to clearing that bar.
“Things go really, really badly”? Nobody outside the x-risk ecosystem has any idea what that means. And this is not the kind of claim Anthropic or Dario has ever tried to spotlight. You won’t find a big urgent-looking banner on the front page of Anthropic loudly warning the public, in plain terms, about this technology, and asking them to write their congressman about it. You won’t even find it tucked away in a press release somewhere. Dario gave a number when explicitly asked, in an on-stage interview.
If we’re setting the bar at 0, then maybe we want to call this an amazing act of courage, when he could have ducked the question entirely. But why on earth would we set the bar at 0? Is the social embarrassment of talking about AI risk in 2025 so great that we should be amazed when Dario doesn’t totally dodge the topic, while running one of the main companies building the tech?
“Meanwhile, you seem to be treating all these people as basically equivalent to Gary Marcus.”
I think Dario has been more reasonable on this issue than Gary Marcus. I also don’t think “clearing Gary Marcus” is the criterion we should be using to judge the CEO of Anthropic.
“I think this ‘debate’ isn’t about OpenPhil or Anthropic failing to say they’re extremely worried”
Specifically, this debate (from my perspective) isn’t about whether Anthropic or others have ever said anything scary-sounding, if an x-risk person goes digging for cherry-picked quotes to signal-boost. The question is whether the average statement from Anthropic, weighted by how visible Anthropic tries to make that statement, is adequate for informing the uninformed about the insane situation we’re in.
Is the average statement from Dario or Anthropic communicating, “Holy shit, the technology we and our competitors are building has a high chance of killing us all or otherwise devastating the world, on a timescale of years, not decades. This is terrifying, and we urgently call on policymakers and researchers to help find a solution right now”? Or is it communicating, “Mythos is our most aligned model yet! ☺️ Powerful AI could have benefits, but it could have costs too. AI is a big deal, and it could have impacts and pose challenges! We are taking these very seriously! Also, unlike our competitors, Claude will always be ad-free! We’re a normal company talking about the importance of safety and responsibility in this transformative period. ☺️”
If Anthropic’s messaging were awful, but Dario’s personal communications were reliably great, then I’d at least give partial credit. But Dario’s messaging is often even worse than that. Dario has been the AI CEO agitating the earliest and loudest for racing against China. He’s the one who’s been loudest about there being no point in trying to coordinate with China on this issue. “The Adolescence of Technology” opens with a tirade full of strawmen of what seems to be Yudkowsky/Soares’ position (https://x.com/robbensinger/status/2016607060591595924), and per Ryan Greenblatt, the essay sends a super misleading message about whether Anthropic “has things covered” on the technical alignment side (https://x.com/RyanPGreenblatt/status/2016553987861000238):
“Dario strongly implies that Anthropic ‘has this covered’ and wouldn’t be imposing a massively unreasonable amount of risk if Anthropic proceeded as the leading AI company with a small buffer to spend on building powerful AI more carefully. I do not think Anthropic has this covered[....] I think it’s unhealthy and bad for AI companies to give off a ‘we have this covered and will do a good job’ vibe if they actually believe that even if they were in the lead, risk would be very high. At the very least, I expect many employees at Anthropic working on alignment, safety, and security don’t believe Anthropic has the situation covered.”
I also strongly agree with Ryan re:
“I think it’s important to emphasize the severity of outcomes and I think people skimming the essay may not realize exactly what Dario thinks is at stake. A substantial possibility of the majority of humans being killed should be jarring.”
“I wish Dario more clearly distinguished between what he thinks a reasonable government should do given his understanding of the situation and what he thinks should happen given limited political will. I’d guess Dario thinks that very strong government action would be justified without further evidence of risk (but perhaps with evidence of capabilities) if there was high political will for action (reducing backlash risks).”
(And I claim that Anthropic leadership has been doing this for years; “The Adolescence of Technology” is not a one-off.)
On podcast interviews, Dario sometimes lets slip an unusually candid and striking statement about how insane and dangerous the situation is, without couching it in caveats about how Everything Is Uncertain and More Evidence Is Needed and It’s Premature For Governments To Do Much About This. Sometimes, he even says it in a way that non-insiders are likely to understand. But when he talks to lawmakers, he says things like:
“However, the abstract and distant nature of long-term risks makes them hard to approach from a policy perspective: our view is that it may be best to approach them indirectly by addressing more imminent risks that serve as practice for them.”
Never mind the merits of “the policy world should totally ignore superintelligence”. Even if you agree with that (IMO extreme and false) claim, there is no justifying calling these risks “long-term”, “abstract”, and “distant” when you have timelines a fraction as aggressive as Dario’s!!
See also Jack Clark’s communication on this issue, and my criticism at the time (https://x.com/robbensinger/status/1834325868032012296). This was in 2024. I don’t think it’s great for Dario to be systematically making the same incredibly misleading elisions two years after this pretty major issue was pointed out to his co-founder.
“It’s about OpenPhil in particular being pretty careful how they phrase things for public consumption. And I think any attempt to attack them for this should start with an acknowledgement that MIRI is directly responsible for all of our current problems”
I’m not criticizing Anthropic or Open Phil for being “careful how they phrase things”. I’m criticizing them for being careful in exactly the wrong direction. Any communication they send out that sends a “we have things covered, this is business-as-usual, no need to worry” signal is potentially not just factually misleading, but destructive of society’s ability to orient to what’s happening and course-correct. Anthropic is the “Machines of Loving Grace” company; it’s exactly the company that has put way more effort, early and often, into communicating how powerful and cool this technology is, while being consistently nervous and hedged about alerting others to the hazards.
This is exactly the opposite of what “being careful how you phrase things” should look like. Anthropic should have internal processes for catching any tweet that risks implicitly sending a “this is business-as-normal” or “we have everything handled” message, to either filter those out or flag them for evaluation. Sending that kind of message is much more dangerous than any ordinary reputational risk a company faces.
Re ‘MIRI is saying strategy is bad, but if MIRI had been strategic then they might not have started the deep learning revolution’: I think that this just didn’t happen. Per the https://x.com/allTheYud/status/2042362484976468053 thread, I think this is just a myth that propagates because it’s funny. (And because Sam Altman is good at spreading narratives that help him out.)
I don’t think MIRI accelerated timelines on net, and if it did, I don’t think the effect was large. I’d also say that if this happened, it was in spite of one of MIRI’s top obsessions for the last 20+ years being “be ultra cautious around messaging that could shorten AI timelines”.
(Like, as someone who’s been at MIRI for 13 years, this is literally one of the top annoying things constraining everything I’ve written and all the major projects I’ve seen my colleagues work on. Not because we think we’re geniuses sitting on a trove of capabilities insights, but just because we take the responsibility of not-accidentally-contributing-to-the-race extraordinarily seriously.)
But whatever, sure. If you want to accuse MIRI of hypocrisy and say that we’re just as culpable as the AI labs, go for it. You can think MIRI is terrible in every way and also think that the Anthropic cluster is not handling AI risk in a remotely responsible way.
Set aside the years of Anthropic poisoning the commons with its public messaging, poisoning efforts at international coordination by being the top lab preemptively shitting on the possibility of US-China coordination, and poisoning the US government’s ability to orient to what’s happening by selling half-truths and absurd frames to Senate committees.
Even without looking at their broad public communications, and without critiquing what passes for a superintelligence alignment or deployment plan in Anthropic’s public communications, Anthropic has behaved absurdly irresponsibly, lying to the public about their RSP being a binding commitment, lying to their investors re ‘we’re not going to accelerate capabilities progress’, and specifically targeting the most dangerous and difficult-to-control AI capabilities (recursive self-improvement) in a way that may burn years off of the remaining timeline.
“What they haven’t said is ‘the situation is totally hopeless and every strategy except pausing has literally no chance of working’, but that isn’t a comms problem, that’s because they genuinely believe something different from you.”
Just to be clear: nowhere in this thread, or anywhere else, have I asked Anthropic to say something like that. Everything I’ve said above is compatible with thinking that Anthropic has a chance at solving superintelligence alignment. “I think I have a chance at solving superintelligence alignment!” is not an excuse for Anthropic or Dario’s behavior.
“Your claim that ‘governments are incredibly trigger-happy about banning things...there’s a long history of governments successfully coordinating to ban things dramatically less dangerous than superintelligent AI’ is too glib”
I agree it’s too glib as an argument for “international coordination to ban superintelligence is easy”. It isn’t easy. In the context of a conversation where most people are seriously underweighting the possibility, “governments have been known to ban scary or weird tech” and “governments have been known to enact policies that cost them money” are useful correctives, but they should be correctives pointing toward “this seems hard but maybe doable”, not “this seems easy”.
“But my impression is that the rest of the field is executing this portfolio plan admirably, but MIRI and a few other PauseAI people are trying to sabotage every other strategy in the portfolio in the hope of forcing people into theirs.”
How are we doing that, exactly?
Like, this is one of the most foregrounded claims in Dario’s essay. He repeats a bunch of easily-checked falsehoods about the MIRI argument, at the very start of the essay, while warning that this view’s skepticism about alignment tractability is a “self-fulfilling belief”. He then proceeds to shit on the possibility of the US coordinating with China to avoid building superintelligence, which seems like a much more classic example of “belief that could easily be self-fulfilling”.
What is the mechanism whereby Dario criticizing MIRI is “cooperating” (is it that he didn’t mention us by name, preventing people from fact-checking any of his claims?), and MIRI staff criticizing Dario is “defecting”? What, specifically, is the wrench I’m throwing in Anthropic’s plans by tweeting about this? Is a key researcher on Chris Olah’s team going to get depressed and stop doing interpretability research unless I contribute to the “Anthropic is the Good Guys and OpenAI is the Bad Guys” narrative? Is Anthropic at risk of losing its lead in the race if MIRI people are open about their view that all the labs are behaving atrociously? Should I have dropped in a claim that everyone who disagrees with me is “quasi-religious”, the same way Dario’s cooperative essay begins?
If you think I’m factually mistaken, as you said at the start of your reply, then that makes sense. But surely that would be an equally valid criticism whether I were saying pro-Anthropic stuff or anti-Anthropic stuff. Why this separate “MIRI is defecting” idea?
“I worry that any support or oxygen you guys get will be spent knifing other safety advocates, while Sam Altman happily builds AGI regardless.”
Yeah. And when MIRI voiced early skepticism of OpenAI in private conversation, we were told that it was crucial to support Sam and Elon’s effort because Demis was untrustworthy. Counting up from zero, OpenAI could be framed as amazing progress: a nonprofit! Run by people vocally alarmed about x-risk! And they’re struggling for cash in the near term (in spite of verbal promises of funding from Musk), which gives us an opportunity to buy seats on the board!
Anthropic may or may not be slightly better than OpenAI. OpenAI may or may not be slightly better than DeepMind. I don’t think the lesson of history is that OpenPhil-cluster people are good at telling the difference between “this is marginally better than what the other guys are doing” and “this is good enough to actually succeed”.
But nothing I’ve said above depends on that claim. You can disagree with me about how likely Anthropic is to save the world, and still think there’s an egregious candor gap between the average Anthropic public statement and the scariest paragraphs buried in “The Adolescence of Technology”, and a further egregious candor gap between “The Adolescence of Technology” and e.g. Ryan Greenblatt’s post or https://x.com/MaskedTorah/status/2040270860846768203.
I don’t think the “circle-the-wagon” approach has served EA well throughout its history, and I don’t think people self-censoring to that degree is good for governments’ or labs’ ability to orient to reality.
Some helpful points, thanks. I responded in more depth on Twitter, but I don’t want to duplicate every conversation there here, so I’m just signposting that people should check the thread there for most of my opinions.
I respect this—all of our options are bad and unlikely to work, the situation is desperate, and I have no plan better than playing a portfolio of all the different desperate hard strategies in the hopes that one of them works.
I used to support such a portfolio approach, but subsequently realized that it’s actually not safe (i.e., is potentially net-negative even aside from opportunity costs), or the portfolio has be restricted a lot. This is because due to the existence of illegible AI safety problems, solving some (i.e., more legible) AI safety problems can actually make the overall situation worse, by increasing the chances of an unsafe AI being developed or deployed.
According to this logic, safer strategies include:
Pausing AI, and other actions that help broadly with both legible and illegible problems, like improving societal epistemic health.
Making illegible problems more legible.
Working directly on illegible problems.
Another reason to think that many “AI safety strategies” are actually not safe is that even nominally altruistic humans are more power/status-seeking[1] than actually altruistic, and one way this manifests is that they tend to neglect risks more than they should (if they were actually altruistic). See my Managing risks while trying to do good. BTW these days I think not making this idea more prominent early in rationalism/EA/AI safety is a core failure that is upstream of many other errors.
For the purposes of this argument to work, it’s important that the legible problems are so legible that a lack of solutions would prevent deployment.
When previously asked which problems were in this category, you said:
The most legible problem (in terms of actually gating deployment) is probably wokeness for xAI, and things like not expressing an explicit desire to cause human extinction, not helping with terrorism (like building bioweapons) on demand, etc., for most AI companies
Now, I would actually say that this list overestimates AI companies’ willingness to gate deployment on unsolved problems. There’s been many woke versions of grok, suggesting they weren’t gating deployments on that. I think most current models can be jailbroken into helping with terrorism (they’re just not smart enough to be very helpful yet). It remains to be seen whether companies will hold off on releasing models that could help a lot with terrorism. I’m not so sure they will.
But even if we took this on face value: It doesn’t seem like avoiding work on these mentioned problems would mean restricting the portfolio a lot. When referring to “playing a portfolio of all the different desperate hard strategies in the hopes that one of them works”, I think that’s mostly about solving problems that wouldn’t prevent deployment if they were unsolved, or gathering evidence for such illegible problems. (Centrally: The problem of scheming models taking over the world, which is not one that I expect companies to wait for a solution on absent further evidence that it’s a problem.)
Applying the idea is tricky and context-dependent. For example, gathering evidence for scheming seems unambiguously good, but actually solving scheming could be bad (unless you’re sure that such evidence can’t be gathered, or companies will not gate on this problem regardless), because some time in the future, it may well become legible enough to be gating deployment. (Also keep in mind that it’s not just legibility/gating by the companies, but also by other policymakers such as voters and politicians.)
Given the tradeoffs apparent to me (including that the benefits of solving scheming are limited by other safety problems), I think it may well be an example of a safety problem that is net negative to work on, and something I wouldn’t want to do myself. But I’m unsure how to argue for this convincingly (and also am just not certain enough to want to talk other people out of working on this specifically) which is why I’m only talking about it in response to your comment.
FWIW, on my views, work to prevent scheming looks pretty clearly great. Pausing to wait for a solution to scheming doesn’t seem super likely, and going from [scheming models widely deployed] –> [non-scheming models widely deployed] seems significantly more valuable than going from [non-scheming models widely deployed] –> [temporary pause to solve scheming].
A lot of the listed topics here are problems that we could have plenty of time to work on after the singularity. I’m sympathetic to arguments that bad things might get locked-in, but I don’t really think the arguments for this have a disjunctive nature where we’re very likely to run into at least one type of bad lock-in. There’s just a decent chance that we do an ok job of developing AIs and handing over to a society that’s more capable than us at dealing with these issues (not a super high bar), in which case a pause wouldn’t add much. (The arguments that make me feel most pessimistic about the future are arguments that humans might just not be motivated to do good things — but it’s not clear why pauses would help much with that issue.)
There’s just a decent chance that we do an ok job of developing AIs and handing over to a society that’s more capable than us at dealing with these issues (not a super high bar), in which case a pause wouldn’t add much.
The aim of a pause would be to plan out the transition better, or make humans smarter/wiser so they can navigate the transition better, so that we end up handing over remaining problems to a counterfactually more capable society. In other words, the bar shouldn’t be “more capable than us” but a society that could realistically be achieved with a pause.
The arguments that make me feel most pessimistic about the future are arguments that humans might just not be motivated to do good things — but it’s not clear why pauses would help much with that issue.
One issue related to this is that humans today largely want to do good things as a side effect of virtue signaling / status games that they’re doing/playing. This is currently far from optimal, which makes me scared to undergo an AI transition that could potentially lock-in such highly suboptimal motivations/values, and also scared that the AI transition could just scramble or reset these status games and remove what good motivations/values we do have. A pause would preserve the status quo and give people more time to think about such issues (including time for the idea to spread), and potentially find ways to make the AI transition go better in these regards (compared to today when there has been almost no thought on these issues at all).
But see also this recent quick take where I expressed that my optimism about a pause is pretty limited.
The aim of a pause would be to plan out the transition better, or make humans smarter/wiser so they can navigate the transition better, so that we end up handing over remaining problems to a counterfactually more capable society. In other words, the bar shouldn’t be “more capable than us” but a society that could realistically be achieved with a pause
If the society is “more capable than us” in some average sense, where we still have certain advantages over them, then I agree that we could still contribute things.
If the society is “more capable (and good) than us” in all the important ways, then they’d also be better at making themselves smarter/wise than we would have been, and better at handling the transition, so further pauses really wouldn’t have contributed much.
Idk, I don’t know particularly want to argue about definitions here. I just think there’s a decent chance that I’ll look back after the singularity and be like “yep, the sloppy transition sure meant that we took on a bunch of ex-ante risk, but since we got lucky, extra pause time wouldn’t have helped vis-a-vis the long-run lock-in issues. Anything they could have done to help is stuff we can do better now.” (And/or: Marginal pause time may have been good or bad via various values or power changes, but it wouldn’t have systematically led to improvements from everyone’s perspective by e.g. enabling additional intellectual work, because it turns out it was fine to defer the relevant intellectual work until later.)
If the society is “more capable (and good) than us” in all the important ways, then they’d also be better at making themselves smarter/wise than we would have been, and better at handling the transition, so further pauses really wouldn’t have contributed much.
Even this society, if it’s in the future, then part of the transition would have already occurred, so they won’t have the opportunity to make it go better. So by not pausing now we’d permanently give up this opportunity.
Take the issue in this recent comment, of building an initial AGI that reasons well or poorly about domains that lack fast/cheap feedback signals. It seems very plausible that our long-term civilizational trajectory is significantly affected by which type of AGI gets built first. Suppose we end up building one that reasons poorly about such domains, then:
The post-AGI civilization may end up being less capable (and good) than us on average, or in some important ways.
Even if they’re actually more capable (and good) than us in all the important ways, they could have been even better if only we had built an AGI that reasons well in such domains, but they can’t go back in time and change this.
seems very plausible that our long-term civilizational trajectory is significantly affected by which type of AGI gets built first
I of course agree, but I’d think this would mostly be an issue of capabilities or goodness of our future society, since there’s not much external to our society that’s getting worse as a result of the transition. Anyway, that seems like maybe one of those definitional issues. I think you’re probably right that there’s some possible changes that aren’t well characterized as being about the capabilities or goodness of our society, so an improvemet in those dimensions aren’t strictly speaking sufficient for a pause to not have been valuable.
I care more about my claim that started with “I just think there’s a decent chance...”. (Which is importantly only asserting a decent chance, not saying that there aren’t plausible ways it could be false.)
Copying over my response to Scott from Twitter (with a few additions in square brackets):
I think my biggest disagreement here is about the concept of strategic communications.
In particular, you claim that MIRI should have been more PR-strategic to avoid hyping AI enough that DeepMind and OpenAI were founded.
Firstly, a lot of this was not-very-MIRI. E.g. contrast Bostrom’s NYT bestseller with Eliezer popularizing AI risk via fanfiction, which is certainly aimed much more at sincere nerds. And I don’t think MIRI planned (or maybe even endorsed?) the Puerto Rico conference.
But secondly, even insofar as MIRI was doing that, creating a lot of hype about AI is also what a bunch of the allegedly PR-strategic people are doing right now! Including stuff like Situational Awareness and AI 2027, as well as Anthropic. [So it’s very odd to explain previous hype as a result of not being strategic enough.]
You could claim that the situation is so different that the optimal strategy has flipped. That’s possible, although I think the current round of hype plausibly exacerbates a US-China race in the same way that the last round exacerbated the within-US race, which would be really bad.
But more plausible to me is the idea that being loud and hype-y is often a kind of self-interested PR strategy which gets you attention and proximity to power without actually making the situation much better, because power is typically going to do extremely dumb stuff in response. And so to me a much better distinction is something like “PR strategies driven by social cognition” (which includes both hyping stuff and also playing clever games about how you think people will interpret you) vs “honest discourse”.
To be clear I don’t have a strong opinion about how much IABIED fits into one category vs the other, seems like a mix. A more central example of the former is Situational Awareness. A more central example of the latter is the Racing to the Precipice paper, which lays out many of the same ideas without the social cognition.
My other big disagreement is about which alignment work will help, and how. Here I have a somewhat odd position of both being relatively optimistic about alignment in general, and also thinking that almost all work in the field is bad. This seems like too big a thing to debate here but maybe the core claim is that there’s some systematic bias which ends up with “alignment researchers” doing stuff that in hindsight was pretty clearly mainly pushing capabilities.
Probably the clearest example is how many alignment researchers worked on WebGPT, the precursor to ChatGPT. If your “alignment research” directly leads to the biggest boost for the AI field maybe ever, you should get suspicious! I have more detailed modes of this which I’ll write up later but suffice to say that we should strongly expect Ilya to fall into similar traps (especially given the form factor of SSI) and probably Jan too. So without defusing this dynamic, a lot of your claimed wins don’t stand up.
More than any other group I’ve been a part of, rationalists love to develop extremely long and complicated social grievances with each other, taking pages and pages of text to articulate. Maybe I’m just too stupid to understand the high level strategic nuances of what’s going on—what are these people even arguing about? The exact flavor of comms presented over the last ten years?
As someone who spends a significant part of his time briefing policymakers in Europe, ministerial advisors, senior civil servants in AI governance, I want to point out something obvious from where I stand, but absent from this discussion.
The “radical transparency vs. strategic communication” debate presupposes that framing is the bottleneck. It isn’t. The bottleneck is volume. Most policymakers have never heard the argument, no matter how you frame it. Among the ones I interact with, maybe 2% have been exposed to the problem enough to have an opinion. Another 10% or so have heard something, but mostly through the Yann LeCun-adjacent dismissals, and formed their view from that. The remaining ~88%, including people in very important AI governance positions, have simply never had the conversation.
The question of which approach works better is real but secondary. What’s missing is more people doing this work at all. It’s a campaign, and the limiting factor is coverage, not the message.
To give a concrete data point: the only policymaker in my circles who has ever brought up “If Anyone Builds It, Everyone Dies” is Lord Tim Clement-Jones, chair of the All-Party Parliamentary Group on AI in the UK. And he was probably already sympathetic. That’s one person.
Among other things, the fact that one of the leading ASI lab is substantially downstream of us. Separately, a lot of real actual politics that tends to happen in the community around prestige and money and talent allocation and respect, which needs to get litigated somehow (and abuse of power and legitimacy is common and if you can’t talk about it you can’t have norms about it).
I think if your main interactions with PauseAI is a certain Twitter account, as served to you by the algorithm in interactions with your AI safety friends, then you might think that they’re mostly going after other, more moderate safety advocates. But this just isn’t a good picture of the overall actions of the movement. At least in the case of PauseAI UK, of which I have a decent understanding of our inner workings, essentially zero time is spent thinking about other AI safety advocates. I expect that the same is true of Yudkowsky and MIRI.
Of course it is the case being rude towards people working on safety teams at OpenAI on Twitter makes some things worse on some axes. And this is mostly bad and pointless and I don’t endorse it. But that’s not even really what that post from Rob was doing! Rob was writing an opinionated, but civil, criticism. In what way is this “knifing” the other AI safety advocates? It’s not like MIRI killed SB 1047.
Now if Scott means something like “Giving money to MIRI pushes the world in the MIRI-preferred direction, and this would have meant no Anthropic and no safety team at OpenAI” then I can kind of maybe see what he means here. This just isn’t “knifing” in the sense of the betrayal that most people mean by the word. It’s just opposing someone’s plan, in a way that they’ve been doing for years. It’s not like MIRI would have actually used marginal resources to stop Anthropic from being created by, like, sabotage or something.
MIRI don’t even say that working in safety is bad! They only say that they think their approach is better. IABIED specifically states that they think mech interp researchers are “heroes” (as part an example of research they think won’t work in time without political action).
Scott Alexander left an important reply to Rob Bensinger on X. I happen to agree with Scott. Here’s the original post by Rob:
The reply by Scott Alexander:
I think that both of these posts seem very confused about the dynamics of who says or thinks what, and I’m pretty sad about these posts.
Thoughts on Rob’s post
In general, I’ll note that I don’t think Rob really knows many of the OP people; I suspect he has spent <40 hours talking to them about any of this possibly ever. (This is in contrast to e.g. Habryka.) I don’t know where he’s getting his ideas about what the OP people think, but he seems incredibly confused and ignorant. (Eliezer seems similarly ignorant about who believes what.)
I don’t really think this is true
I wish Rob would be clear who he was referring to. Dario has beliefs that seem to me very different from most people who worked on the 2022 AI misalignment risk efforts at Open Phil. (I’m thinking of people like Holden Karnofsky, Ajeya Cotra, Joe Carlsmith, Lukas Finnveden, Tom Davidson. I’ll refer to this as “OP AI people” despite the fact that none of them work at Coefficient Giving (which OP renamed to).) Maybe Rob is talking about what Alexander Berger thinks?
I think both Dario and Open Phil staff have been reasonably honest about their beliefs about catastrophic misalignment risk publicly, I think that Dario genuinely thinks it’s <5% and the OP AI people generally think it’s higher. (Tbc I think Dario’s take here is very bad!)
This is a reasonable statement of (a simple version of) the Dario/Jared/Anthropic position, but not the OP AI person position. The OP AI people were worried about AI misalignment and ASI enough to try to think it through in detail starting many years ago!
This is not what the OP people think, e.g. see 1 2 3. It’s a reasonable description of what Dario/Jared say.
This is not what the OP people think. I think it’s somewhat reasonable to accuse Anthropic of this.
I’ve never felt any pressure to play down my concerns from the OP people. For example, I’ve been in a lot of discussions about whether it’s better for MIRI to be more or less powerful or influential. To me, the main argument that it’s bad for MIRI to be more influential isn’t that MIRI is making a mistake by openly saying that risk is high. It’s that MIRI has beliefs about x-risk that are wrong on the merits which lead them to making unpersuasive arguments and bad recommendations, and they’re in some ways incompetent at communicating.
And I think this is not very representative of what Ant thinks. E.g. they don’t really think of themselves as coordinating with other AI-safety-concerned people.
This is somewhere between “strawman” and “just totally confused as a description of what people believe”
Basically everything else in Rob’s post seems like a strawman.
Overall, I think this post is extremely confused, and Rob should be ashamed of writing such incredibly strawmanned things about what someone else thinks.
I recommend that people place very little trust in claims Rob makes about what other people believe. As someone who knows and talks regularly to the “Open Phil AI people”, I seriously think that Rob has no idea what he’s talking about when he ascribes arguments to them.
I guess there’s the question of what we are supposed to do if, in fact, the OP people agree with Rob’s version of their position but publicly deny that—at that point we’d have to do some brutal adjudication based on confusing private evidence or inferences from public actions and statements. I really don’t think that looking into that evidence would support Rob’s claims.
Thoughts on Scott’s post
I don’t really think of Rob or MIRI as having a comms strategy of undermining EAs. I think Rob and Eliezer just say a bunch of false, wrong things about EAs because they’re mad at them for reasons downstream of the EAs not agreeing with Eliezer as much as Eliezer and Rob think would be reasonable, and a few other things.
Some EAs engage in equivocation and shyness about their beliefs; OP AI people less than many others.
I think Dario (like various other Anthropic people) does not believe that AI takeover is a very plausible outcome, and I think his position is indefensible on the merits, as are some of his other AI positions (e.g. his skepticism that there are substantial returns to intelligence above the human level, his skepticism that ASI could lead to 2x manufacturing capacity per year). He moderately disagrees with the OP people about this.
I don’t totally understand what point Scott is trying to make here, but I think this point is quite unfair.
Agreed
I think Scott is blaming MIRI much too much here. Dario’s main difficulty when arguing that he thinks AI will pose huge catastrophic risk in the next few years is that lots of people think this seems implausible on priors, not because those people were specifically turned off by MIRI making related arguments earlier. His core audience has never heard of MIRI.
I think this is an incorrect read. Some people from PauseAI and MIRI criticize AI safety efforts a lot, often in ways I think are really dumb and counterproductive. But I don’t think they’re doing this as part of a strategy to force people into their strategies; it’s because of some combination of them genuinely (but perhaps foolishly) thinking that the other strategies are bad and/or the people executing them are corrupt.
I disagree in a lot of the claims here about how various aspects of the current situation are good. (E.g. why does he think that Ilya is doing an alignment effort?)
It’s unclear what “you guys” means. I think Pause AI is making a variety of bad strategic choices. I think that knifing other safety advocates is one bad strategic choice, but it’s more like a bad choice that is downstream of my main problems with them, rather than my core concern about them. I think Rob is totally unreasonable and I wish he would stop working on AI safety, but I think he’s much worse than e.g. MIRI is overall. I think MIRI spends very little of their support on knifing AI safety advocates, they spend almost all of it on advocating for people being scared about misalignment risk and advocating for AI pauses (which I am generally in favor of). Eliezer totally does have a hobby of saying ridiculously strawmanny stuff about OP AI people, which I find
prettyextremely annoying, but I don’t think it’s a big part of his effect on the world.----
Overall, both posts seem to have substantially inaccurate pictures of what’s going on and what various actors think.
Thanks for writing this, Buck. I’m not going to try to reply to your whole post, because I think some of it is stuff I should chew on for longer and see whether I agree with it. But going through some of your points:
I definitely apologize for making it sound like I was making a harsher criticism of (the relevant parts of) EA than I intended. My tweet was originally written as a quick follow-up comment to someone who asked why I thought EA’s impact on AI x-risk was only ~55% likely to be positive. I turned it into a top-level tweet because I didn’t want to hide it deep in an existing discussion, but this was an error given I didn’t add extra context.
I also apologize for anything I said that made it sound like I was universally criticizing past or present Open Phil / cG staff (or centrally basing my views on first-hand conversations, for that matter). I already believed that tons of past and present rank-and-file OP/cG staff have very reasonable views, and I happily further update in that direction based on your and Oliver’s statements to that effect (e.g., Ollie’s “I have since updated that more people who are a level below Alexander, Dustin and Dario have more reasonable beliefs”).
I agree that my characterization of “Dario and a cluster of Open-Phil-ish people” was phrased in a needlessly confusing and sloppy way. I wanted to talk about a mix of ‘present-day views that seem to be endorsed by Dario and some other key figures’ and ‘general tendencies and memes that seem pretty widespread and that seem suspiciously related to choices EA leadership made many years ago’, but blurring these together is really unnecessarily confusing. Also, it didn’t help that I was sarcastically embedding my criticisms into my summaries of the views.
Insofar as my broad criticism of EA cultural trends/memes is correct (which I think is substantial), I still feel a fair bit of uncertainty about how to divvy up responsibility between more Open-Phil-ish people, more Oxford-ish people, MIRI / the rats, etc. And of course, some of the problem may stem from broader social-or-demographic factors that no EA leaders tried to engineer, and that even go counter to how leadership has tried to optimize. (I too remember the early speeches themed around “Keep EA Weird”, the early EA-leader conversations fretting about overly naive EA consequentialism, etc.)
Thanks, this is helpful and I basically accept most of what you’re saying. Some more specific comments on the part about me:
I accept this criticism and take back my claim. I noticed that some people who worked for MIRI comms seemed to do this, and I assumed that anything said by enough MIRI comms people in a serious-sounding voice was on some level a MIRI communique. Eliezer has clarified that this isn’t true, so I apologize for saying it was.
I basically agree with this (while wanting to clarify that I think he assigns a pretty high risk to permanent dictatorship or something along those lines) but I think he’s done an okay job of navigating uncertainty, realizing that even a low chance of human extinction is very bad, and being willing to (somewhat) cooperate and collect gains-from-trade with people who are doomier than he is. I see him as living in a consistent worldview next door to our movement’s (sort of like Vitalik or Dean Ball) and I think that, like those two people, he’s potentially somewhere between a friend / an ally-of-convenience / a negotiating partner, potentially convertible into a full ally if future events prove us right, or into a true enemy if we pre-emptively alienate him. Having someone like this in charge of a frontier lab is better than I expected (Demis might also be in this category, but I’m not sure, and worry that Larry and Sergey have final say).
I agree that Dario is slightly being a jerk here, but I think that people have lots of stereotypes of “doomers” which derive from some real behavior of MIRI and PauseAI, and which wouldn’t exist if the median pause AI person was eg the median Constellation person, and I think Dario feels some understandable incentive to distance himself from this.
I have no useful knowledge here, but Ilya seems genuinely alignment-pilled and terrified, the fact that he did the very courageous and self-sacrificing thing of trying to blow up OpenAI to try to get rid of Altman for what were mostly safety-related reasons speaks well of him, and IDK, he’s calling it “safe superintelligence” and saying he won’t release anything at all until he’s sure. I don’t claim any secret expertise in Ilya-ology but overall all of this seems encouraging and I’m surprised this part of my tweet attracted so much dissent.
I mostly accept your criticism that I should narrow my objections from “MIRI & Co” to “Pause.AI, Rob, maybe sort of Eliezer, & a slightly different co”. I don’t really know how to do this or what one word covers all of them without inflicting different forms of collateral damage (I don’t want to say “PauseAIers” because that also covers some people I like, and it feels extra-aggressive to name specific names), but I’m open to suggestion.
I’m generally sympathetic to Scott’s positions in this discussion, but I think he is probably very wrong about Ilya.
To the best of my knowledge, Safe Superintelligence has never published a single word about what they plan to do move alignment forward, which is pretty damning. in my opinion.
I have not heard of anyone who is known to be thoughtful about AI safety to have been hired to SSI, and I have not seen any position being advertised to AI safety people. People should correct me if I missed someone good joining SSI, but I think this is also a very bad sign.
My impression is that people who worked with Ilya at OpenAI don’t remember him as being particularly thoughtful about alignment, e.g. much less so than Jan Leike. This is a low confidence, third-hand impression, people can correct me if I’m wrong.
My impression is that the available evidence suggests that Ilya mostly took part in Altman’s firing for (perhaps justified) office politics grievances, and not primarily due to safety concerns. I also think that evidence points to his behavior during and after the incident being kind of cowardly. (I haven’t looked deeply into the details of the battle of the board, and it’s possible I’m wrong on this point, in which case I apologize to Ilya.) I’m also doubtful of how self-sacrificing think actions were—my best guess is that his current net worth is higher (at least on paper) than it would be if he stayed at OpenAI.
I expect that at some point SSI’s investors will grow impatient, and then SSI will start coming out with AI products (perhaps open-source to be cooler), just like everyone else. I don’t expect them to contribute too much to safety, though maybe Ilya will sometimes make some noises about the importance of safety in public speeches, which is nice I guess.
I’m pretty confident in my first two points, much less so in the next two, but I felt someone should respond to Scott on this point. Perhaps @Buck or someone else who expressed skepticism of Ilya’s project can add more information.
I think you are overfitting Rob’s post to be about the wrong people. I think it’s much closer to accurate, if you actually read what he says, which is:
I think the things Rob is saying still have some strawman-y nature to them, but I think they are reasonably accurate descriptors of Anthropic leadership, plus my best guesses of what Alexander (head of Coefficient Giving) and Zach (head of CEA) believe, which seems well-described by “Dario and a cluster of Open-Phil-ish people”, and furthermore also of course constitutes an enormous fraction of the authority over broader EA.
I feel like almost all of your comment is just running with that misunderstanding and hence mostly irrelevant.
As you say yourself, almost no one in your list works at cG, or is in any meaningful position of authority at cG, so this feels like a bit of an absurd interpretation (I think trying to apply the things he is saying to Holden is reasonable, given Holden’s historical role in cG, and I do think he in the distant past said things much closer to this, but seems to have changed tack sometime in the past few years).
A lot of Rob’s complaints are about things that happened in the past, so I don’t think it’s crazy to interpret him as talking about people who worked at CG in the past.
I think that these people believe different things, and I don’t think Rob’s post particularly accurately describes any of them. For example, the Anthropic leadership doesn’t really think of themselves as trying to coordinate with AI safety people or trying to suppress them. I don’t think Alexander thinks “AI is going to become vastly superhuman in the near future” (and fwiw I don’t think Dario thinks that either, he doesn’t seem to believe in returns to intelligence substantially above human-level).
(sending quickly, I might be wrong)
Fair enough. I think that the people you list also used to believe things closer to what Rob is saying in the past, so at least we need to do a consistent comparison. Holden from 10 years ago seems to say a lot of the things that Rob is saying here, and Ajeya from a few years ago also said things more like this (more point 1 and 3, less point 2).
My guess is that it is worth digging up quotes here, but it’s a lot of work, so I am not going to do it for now, but if it turns out to be cruxy, I can.
(Again, I don’t think these are centrally the people Rob is talking about in either case. I think centrally he is talking about Anthropic, and then secondarily talking about how Open Phil people have related to Anthropic over the years, but I do still think his criticism is correct directionally for those people)
I think Alexander abstractly believes that AI could very well become vastly superhuman in the near future, but yes, similar to Dario does not believe that speculating about such a thing in a non-scientific non-empirical way is appropriate, and as such they do not have coherent beliefs about this. Indeed, it seems like really a quite central match to what Rob is saying.
I don’t remember anything like this. I think it might be misremembered or a strained interpretation.
Here are points 1 and 3 for reference:
I asked ChatGPT to read bioanchors (where I thought this was most likely to occur), and then to read all of her other writings looking for anything that fits that mode. Here’s its reply, not finding anything.
The closest match it finds is that Ajeya often caveats her claims. For example from bio anchors:
I don’t think this matches points 1 or 3 well.
Huh, I am a bit confused about you summarizing that ChatGPT response that way. Maybe we are talking past each other, but Robby’s statements are not intended as the kind of statement that passes people’s ITT (which IMO is fine, frequently summaries of other people’s views should not pass their ITT, though it should ideally be caveated when this is going on).
Despite that, your ChatGPT transcript says:
I am not expecting any direct endorsements of these statements (which are phrased as to make their internal contradictions most obvious), so this ChatGPT response seems compatible with what I am saying?
When I asked ChatGPT to “rephrase these two beliefs in more neutral language that would make more sense for someone to endorse (but try to pretty tightly imply the above)” it gave these two:
When I asked ChatGPT about this framing, it said:
But also, when we are in the domain of “evaluate whether Ajeya said things that imply the things above and result in other people getting the same vibe as the above”, then ChatGPT and Claude seem like much worse judges, so I think this question becomes more difficult to answer and I wouldn’t super defer to the language models (and is part of why I expected it would take a while to dig up quotes and do the work and stuff).
(If you want to complain that Robby should have caveated his stuff more as not being the kind of thing that passes people’s ITT, then I am happy to argue about that. I think a better post would have done it, but it’s not something I think is always necessary to do.)
(Also just for the sake of completeness, I don’t get this vibe from Ajeya at all these days and have no complaints on this front, besides probably still some strategic disagreement on stuff around point 3, but like at the level that I have with many people I respect almost certainly including you)
When you wrote:
I interpreted you as claiming that Ajeya had said “things more like:”
I don’t recall any examples of Ajeya saying or implying anything at all like that. I asked ChatGPT to try to find examples and I think it didn’t find anything.
In your ChatGPT session, a typical example it cites is:
I think those examples don’t meaningfully support the original claim, at least as a typical reader would understand it.
I have no interest in defending ChatGPT’s claims here, and feel like I caveated that pretty explicitly. I agree that quote is largely irrelevant.
Yep, I agree with you that ChatGPT did not find any clear quotes (though it doesn’t look like ChatGPT tried very hard to find quotes). I disagree that it didn’t find “anything at all like that” (indeed ChatGPT is quite explicit that it found some things “kind of like that”).
I do. As I said, I could go and dig them up but it would take quite a while, and I am only like 75% confident they are written up as opposed to conversations, or private Google Docs or something that I would have trouble finding. It was a strong vibe I got at the time and I remember having a few conversations about adjacent conversations either with Ajeya or being about Ajeya.
Let me know if you want me to do this. I don’t quite know what’s at stake here for you, and I feel somewhat like we are talking past each other and before I do that it would be more productive to go up some meta-level, but I am not quite sure.
I think you’re right, and also it seems misleading / like a bad clustering to lump “the EAs” in with “Anthropic’s leadership”. I think those groups have some memetic connections, but they’re not the same group!
I feel like it’s more of a reasonable carving to lump in OpenPhil with “the EAs”, since they were/are effectively EA thought-leaders and they exerted a lot of influence, directly and indirectly.)
More than 50% of the talent-weighted safety people in EA are literally employees of Anthropic! The ex-CEO of Open Phil now works at Anthropic, and is married to one of its founders. These groups have enormous overlap.
Like, there is so enormous overlap, and the overlap results in such an enormous amount of de-facto deference (being an employee of a company is approximately the strongest common deference relationship we have) that it makes sense to think of these as closely intertwined.
Yes, there are people who attach the EA label themselves who are different here, sometimes even quite substantial clusters. But it’s also IMO clear from Scott’s response that he himself is also majorly deferring and is majorly supportive of Anthropic as a representative of EA, so this clearly isn’t just a split between “everyone who works at Anthropic and everyone who doesn’t”.
Rob used “Open Phil” exactly two times. One time saying “a cluster of Dario and Open-Phil-ish people” and another time “EAs / Open Phil” in reference to the broader community that includes all of these things. These seem like totally reasonable ways of using these pointers and words. I don’t have anything better. It’s definitely not “just Anthropic” as I think Scott very unambiguously demonstrates, and it would be of course extremely confusing to refer to Scott as “Anthropic”.
Imagine re Open Phil and hardcore rationalists “the ex-CEO of MIRI now works at Open Phil, and and the CEO of Lightcone is dating an Open Phil employee. These groups have enormous overlap.”
Yes. People can have a lot of social overlap, yet have very different views from one another, especially in the broader Bay Area intellectual ecosystem. My sense is that Anthropic leadership has very different views from most AI safety EAs.
Why do you think this? I’m skeptical this is true, especially if you’re including non-technical talent.
IDK, I counted them? I made some spreadsheets over the years, and ran this number by a bunch of other people, and my current guess is that it’s around 55%? When I list organizations with full-time employees working in safety I actually end up at substantially above 50% of people working at Anthropic, but I think that’s overcounting.
I think there are differences and overlaps. I think Rob points to a thing that is shared across a cluster that spans both of them, and has historically had a lot of influence.
But aren’t Alexander Berger’s views not very relevant about OpenPhil’s AI strategy decisions from many years ago when their AI strategy and worldview—which I take to be very cose to the things Rob was criticizing—were worked out and started shaping the views of EAs in OpenPhil’s orbit?
Even now, when people criticize things OpenPhil has done in the past in the AI landscape, or criticize their general worldview and takes on AI risk (as it was developed in influential pieces of writing), I am by default automatically viewing it as criticism of Holden, Ajeya Cotra, Tom Davidson, Joe Carlsmith, etc. If people don’t intend me to interpret them that way, please be more clear. 🙂
I’m aware that, separately, OpenPhil/Coefficient Giving has undergone quite a transition and that you clashed badly with Dustin M. I think that’s very sad and unfortunate, but I think of these as quite distinct things and I never assumed that the thing with Dustin M. had anything to do with OpenPhil’s AI strategy decisions in (say) five years ago (edit: sorry that sounds like a strawman, but I mean something like “I’m not sure the same cause explains why some people who were at OpenPhil in the past found MIRI epistemically off-putting, and why Dustin M finds the rationalists to be a reputation risk & thinks reputation risks are unusually bad compared to other bad things.”) I could be wrong, of course, and maybe you think the org has a general thing of them of valuing “reputability” and “playing politics” too much. I just want to note that it’s not obvious how much these things are connected/caused by one “OpenPhil culture,” vs being about distinct things. (I think some of these are maybe directionally accurate as criticism, btw.)
I’m sure this is obvious to everyone involved, but I also just want to point out that when a lot of senior people leave, organizations can change really a lot, so it would be weird to speak of OpenPhil/Coefficient Giving now as though it were obviously still the same entity/culture.
I think Holden at the time believed something closer to what Rob says here (though it’s still not an amazing fit), and more generally, I think “the beliefs of the successor CEO” are actually a better proxy for “the vibes of the broader ecosystem you are part of” than “the beliefs of the founder CEO”. I could go into more detail on my beliefs on this, though I think the argument is reasonably intuitive.
Yep, I think they are highly related. Indeed, I was predicting things like the Dustin thing without any knowledge of Dustin’s specific beliefs, and my predictions were primarily downstream of seeing how Anthropic’s position within the ecosystem was changing, and a broader belief-system that I think is shared by many people in leadership, not just Dustin.
I have since updated that more people who are a level below Alexander, Dustin and Dario have more reasonable beliefs, but also updated that those things end up mattering surprisingly little for what actually ends up a strategic priority.
I think the “OpenPhil culture” thing is a distraction. In my model of the world most of this is downstream of people being into power-seeking strategies mostly from a naive-consequentialist lens, which is not that unique to OpenPhil within EA (and if anything OpenPhil has some of the people with the best antibodies to this, though also a lot of people who think very centrally along these lines, more concentrated among current leadership).
What do you mean by this?
I think some of the people who are best at thinking independently about stuff, and are pretty good at not getting swept up in the power-seeking stuff, work at Open Phil. I think Holden genuinely helped with some of the correct cultural pieces, and my current belief is that if he wasn’t under the most pressure that anyone is, that he would probably have a relatively sane relationship to Anthropic as a result of it, though I am not as confident I am about that as I am that he had a bunch of quite good cultural pieces that help people be less naively power-seeking here.
Honestly, this is such a bad reply by Scott that I… don’t quite know whether I want to work on all of this anymore.
If this is how this ecosystem wants to treat people trying their hardest to communicate openly about the risks, and who are trying to somehow make sense of the real adversarial pressures they are facing, then I don’t think I want anything to do with it.
I have issues with Rob’s top-level tweet. I think it gets some things wrong, but it points at a real dynamic. It’s kind of strawman-y about things, and this makes some of Scott’s reaction more understandable, but his response overall seems enormously disproportionate.
Scott’s response is extremely emblematic of what I’ve experienced in the space. Simultaneous extreme insults and obviously bad faith arguments (“actually, it’s your fault that Deepmind was founded because you weren’t careful enough with your comms”), and then gaslighting that no one faces any censure for being open about these things (despite the very thing you are reading being extremely aggro about the lack of strategic communication), and actually we should be happy that Ilya started another ASI lab, and that Jan Leike has some compute budget.
The whole “no you are actually responsible for Deepmind” thing, in a tweet defending that it’s great that all of our resources are going into Anthropic, is just totally absurd. I don’t know what is going on with Scott here, but this is clearly not a high-quality response.
Copying my replies from Twitter, but I am also seriously considering making this my last day. It’s not the kind of decision to be made at 5AM in the morning so who knows, but seriously, fuck this.
IMO this doesn’t seem like the kind of response you will endorse in a few days, especially the “You are responsible for Deepmind/OpenAI” part.
You were also talking about AI close to the same time, and you’ve historically been pretty principled about this kind of stance.
Robby at least has been very consistent on this that he is against most forms of strategic communication in general.
I also think you are against many forms of strategic communication in general? Your writing explores many of the relevant considerations in a lot of depth, and you certainly have not shied away from sharing your opinion on controversial issues, even when it wasn’t super clear how that is going to help things.
I think you are just arguing the wrong side of this specific argument branch. My model of Eliezer, Nate and Robby all have been pretty consistent that being overly strategic in conversation usually backfires. Of course you shouldn’t have no strategy, and my model of Eliezer in-particular has been in the past too strategic for my tastes and so might disagree with this, but I am pretty confident Robby himself is just pretty solidly on the “it’s good to blurt out what you believe, *especially* if you don’t have any good confident inside view model about how to make things better”.
I feel like we both know this is a strawman. The key thing at least in recent years that Rob, Eliezer and Nate have been arguing for is the political machinery necessary to actually control how fast you are building ASI, and the ability to stop for many years at a time, and to only proceed when risks actually seem handled.
If anything, Eliezer, Nate and Robby have been actively trying to move political will from “a pause right now” to “the machinery for a genuine stop”.
This makes this comparison just weird. Yes, according to everyone’s models the only time you might have the political will to stop will be in the future. I have never seen Nate or Eliezer or Robby say that they expect to get a stop tomorrow. But they of course also know that getting in a position to stop takes a long time, and the right time to get started on that work was yesterday.
So if they had their way (with their present selves teleported back in time) is that we would have more draft treaties, more negotiation between the U.S. and China. More materials ready to hand congress people who are trying to grapple with all of this stuff. Essays and books and movies and videos explaining the AI existential risk case straightforwardly to every audience imaginable.
That is what you could do if you took the 200+ risk-concerned people who ended up instead going to work at Anthropic, or ended up trying to play various inside-game politics things at OpenAI.
And man, I don’t know, but that just seems like a much better world. Maybe you disagree, which is fine, but please don’t create a strawman where Robby or Nate or Eliezer were ever really centrally angling for a short-termed pause that would have already passed by-then.
And then even beyond that, I think if you don’t know how to solve a problem, I think it is generally the virtuous thing to help other people get more surface area on solving it. Buying more time is the best way to do that, especially buying time now when the risks are pretty intuitive. I think you believe this too, and I don’t really know what’s going with your reaction here.
Come on man, a huge number of people we both respect have recently updated that the kind of direct advocacy that MIRI has been doing has been massively under-invested in. I do not think that “other people are executing this portfolio plan admirably”, and this is just such a huge mischaracterization of the dynamics of this situation that I don’t know where to start.
“If Anyone Builds It, Everyone Dies” is a straightforward book. It doesn’t try to sabotage every other strategy in the portfolio, and I have no idea how you could characterize really any of the media appearances of Nate this way.
This is of course in contrast to Open Phil defunding almost everyone who has been pursuing this strategy and making mine and tons of other people’s lives hell, and all kinds of complicated adversarial shit that I’ve been having to deal with for years, where absolutely there have been tons of attempts to sabotage people trying to pursue strategies like this.
Like man, we can maybe argue about the magnitude of the errors here, and the sabotage or whatever, but trying to characterize this as some kind of “Nate, Eliezer, Robby are defecting on other people trying to be purely cooperative” seems absurd to me. I am really confused what is going on here.
I am sympathetic to the first of these (but disagree you are characterizing Dario here correctly).
But come on, clearly Ilya sitting on $50 billion for starting another ASI company is not good news for the world. I don’t think you believe that this is actually a real ray of hope.
(And then I also don’t think that Jan Leike having marginally more compute is going to help, but maybe there is a more real disagreement here)
Overall, I am so so so tired of the gaslighting here.
Please don’t quit, Oliver.
Unless you mean “making this my last day [on twitter]”, which might or might not be a good idea.
I don’t think Scott speaks for the ecosystem. He’s just a guy in it, and one who isn’t even that closely connected to Anthropic or Coefficient Giving people. (E.g. you spend >10x as much time talking to people from those orgs as he does.) I think that the people in the ecosystem you’re criticizing would not approve of Scott’s post.
I think this is not a good summary of what Coefficient Giving has done. (I do think it really sucks that they defunded Lightcone.)
I think this is false. I expect Scott’s post to be heavily upvoted, if it was posted to the EA Forum to have an enormously positive agree/disagree ratio, and in-general for people to believe something pretty close to it.
There are a few exceptions (somewhat ironically a good chunk of the cG AI-risk people), but they would be relatively sparse. I think this is roughly what someone who is smart, but doesn’t have a strong inside-view take about what they should do about AI-risk believes that they should act like if they want to be a good member of the EA community. My guess is it’s also pretty close to what leadership at cG, CEA and Anthropic believe, plus it would poll pretty well at a thing like SES.
The issue is of course not that Scott is right or wrong about what Anthropic or cG people believe. The issue is that he seems to be taking a view where you should be super strategic in your communications, sneer at anyone who is open about things, and measure your success in how many of your friends are now at the levers of power.
I think cG’s funding decisions were really very centrally about trying to punish people who weren’t being strategic in their communications in the way that Dustin wanted them to be strategic in their communication’s.
I think other “all kinds of complicated adversarial shit” has also happened, though it’s harder to point to. At a minimum I will point to the fact that invitation decisions to things like SES have followed similar adversarial “you aren’t cooperating with our strategic communications” principles.
The EA Forum is a trash fire, so who knows what would happen if this was published there.
My read of the social dynamics is that in places where people are inclined to defer to me or people like me, they might initially approve of the Scott thing for bad tribal reasons, but change their mind when they read criticism of it from me or someone like me (which is ofc part of why I sometimes bother commenting on things like this).
I think that Scott’s post would not overall be received positively by those people. Maybe you’re saying that one of the directions argued for by Scott’s post is approved of by those people? I agree with that more.
Well, I mean, that is a hard conditional to be false since if people were to not change their mind, this would largely invalidate the premise that they are declined to defer to you. Unfortunately, I both think the vast majority of places in EA do not defer to you or people like you, and furthermore, I also think you are pretty importantly wrong about your criticisms, so I don’t quite know how to feel about this.
I do think it helps and am marginally happy about your cultural influence here (though it’s tricky, I also think a bunch of your takes here are quite dumb). I think the vast majority of the cultural influence here is downstream of not quite anyone in-particular, but more Anthropic than anywhere else, and neither you nor me can change that very much.
Yeah, I expect it to be straightforwardly positively received. I think people will be like “some parts of this seem dumb, the Ilya thing in-particular, but yeah, fuck those rationalists and MIRI people, I am with Scott on that”.
To be clear, I am not expecting consensus here, I think this will be what 75% of people who have any opinion at all on anything adjacent on this believe, but I expect people would broadly think it’s a good contribution that properly establishes norms and reflects how they think about things.
I also think it’s plausible people would be like “wow, what an uncough way that both of these people are interfacing with each other, please get away from each other children”, but then actually if you talked to them afterwards, they would be like “yeah, I mean, that was a bit of a shitshow but I do think Scott was basically right here (minus 1-2 minor things)”.
I am not enormously confident on this, but it matches my experiences of the space.
In case it matters to either of you, my guesses:
I agree with Habryka that absent criticism Scott’s post would be well received by an important group of people reasonably characterized as EA-ish AI safety people.
Imo absent criticism Rob’s post would be well received by a different group of people reasonably characterized as doomers. (Literally right before seeing this thread I saw another post on LW that is directionally correct but is mostly wrong or exaggerated in its details, and that was very well received.)
Both posts are broadly wrong about lots of things, about equally so, such that most people would be better off having never encountered either of them.
Tbc, my first-order intuitive impression is that Scott’s post is much more directionally accurate. But I expect that is because I constantly experience people knifing me, pushing me to take strategies that systematically destroy my ability to do anything while gaining approximately no safety benefit, or making claims about members of groups that include me that are false of me, whereas I don’t really experience any of the stuff that Rob gestures at, even though I expect it exists. Though Rob’s post doesn’t actually inform me of it, because his actual claims are false, and I cannot infer the underlying experiences that led him to make them. Another example of trapped priors if you don’t have second order corrections. (Tbc his follow-up post makes this substantially clearer.)
You probably already know I think this, but imo you should both quit on making public discourse in the AI safety community non-insane, and do other things that have a shot at working. (Since I know this will be misinterpreted by other readers, let me be clear that there are plenty of other kinds of public writing that do not fall in that bucket which I do think are worth doing.)
I endorse you taking the space to figure out how you want to relate and doing what’s right for you, I’ve increasingly updated to thinking that people doing things they’re not wholeheartedly behind tends to be net bad in all sorts of sideways ways, but the effort would be weaker for your loss. Wherever you end up, I appreciate you having taken the strategy of speaking in public about things that usually aren’t in a way that helped clarify the strategic situation for me many times.
(also, it’s scary to see three of the people I’d put in the upper tiers of good communication and understanding where we’re at with AI technically get into this intense conflict. I’m going to be thinking on this some and seeing if anything crystalizes which might help specifically, but in the meantime a few more general-purpose posts that might be useful memes for minimizing unhelpful conflict are A Principled Cartoon Guide to NVC, NVC as Variable Scoping, and Why Control Creates Conflict, and When to Open Instead)
I really don’t think Scott is gaslighting you. I think Scott is being honest here, but you should model him as having somewhat snapped. Pause AI and MIRI-adjacent people on X have been extremely adversarial and have been contributing to very bad discourse (even arguments-wise). I think Scott saw Rob’s post as very strawmannish and needlessly adversarial, and he more or less correctly lumped it in with this rising tide of terribleness, even if MIRI itself is definitely not as guilty. I might well be wrong about the specifics, but Scott Alexander isn’t the kind of person who tends to gaslight.
I think you need to be a lot more deflationary about the g-word. If you think, “But ‘gaslighting’ is something Bad people do; Scott Alexander isn’t Bad, so he would never do that”, well, that might be true depending on what you mean by the g-word. But if the behavior Habryka is trying to point to with the word to is more like, “Scott is adopting a self-serving narrative that minimizes wrongdoing by his allies and inflates wrongdoing by his rivals” (which is something someone might do without being Bad due to having “somewhat snapped”), well, why wouldn’t the rivals reach for the g-word in their defense? What is the difference, from their perspective?
“Gaslighting” should probably be avoided because it is anywhere between meaningless and a fighting word depending on who says it and how.
The g-word is a very nasty accusation. It gets thrown around and means a bunch of stuff down to just “saying stuff I disagree with”, but it shouldn’t.
It is originally a conscious, malicious attempt to drive someone insane by strategically lying to them.
On the substance, people are honest but wrong an awful lot, and honest but massively overstating their case even more often. Assuming your rivals are malicious or dishonest when they’re just wrong or overstating is a huge source of conflict and thereby confusion.
It’s a really useful pointer towards a tactic that is relatively widespread and has no better word. I am personally happy to use other words, but I have the sense that sentences like “I am so very very tired of the ambiguous but ultimately strategic enough attempts at undermining my ability to orient in this situation by denying pretty clearly true parts of reality combined with intense implicit threats of consequences if I indicate I believe the wrong thing that might or might not be conscious optimizations happening in my interlocutors but have enough long-term coherence to be extremely unlikely to be the cause of random misunderstandings” would work that well.
Yeah I would call that “gaslighting”. It looks like my initial interpretation of what you meant by it is closer than Zack’s. I think Scott isn’t doing that. I’m inclined to believe you when you say other people have behaved this way.
Locally trying to clear up one misunderstanding.
I think Scott’s “couple more years” wasn’t referring to a belief that EA could have successfully advocated for a couple of year pause, but rather referring to the change in timeline you’d have gotten if safety-sympathetic people refused to work on stuff that increases the pace of capabilities progress.
Oh, I see. That makes sense, I agree I misunderstood this part to be about something else (though I disagree similarly strongly with the correct interpretation, but it’s still good to clear that up).
Everything makes sense when you meditate on how the line between “cooperation” and “defection” isn’t in the territory; it’s a computed concept that agents in a variable-sum game have every incentive to “disagree” (actually, fight) about.
Consider the Nash demand game. Two players name a number between 0 and 100. If the sum is less than or equal to 100, you get the number you named as a percentage of the pie; if the sum exceeds 100, the pie is destroyed. There’s no unique Nash equilibrium. It’s stable if Player 1 says 50 and Player 2 says 50, but it’s also stable if Player 1 says 35 and Player 2 says 65 (or generally n and 100 − n, respectively).
The secret is that there are no natural units of pie (or, equivalently, how much pie everyone “deserves”). Everyone thinks that they’re being “cooperative” and that their partners are “defecting”, because they’re counting the pie differently: Player 1 thinks their slice is 35%, but Player 2 thinks the same physical slice is 65%.
If you don’t think your partner is treating you fairly, your leverage is to threaten to destroy surplus unless they treat you better. That’s what Alexander is doing when he says, “I would like to support it with praxis, but right now I feel very conflicted about this”. He’s saying, “You’d better give me a bigger slice, Player 1, or I’ll destroy some of the pie.”
That’s also what your brain is doing when you say you don’t want to work on this anymore. Scott doesn’t want you to quit! (Partially because he values Lightcone’s work, and partially because it would look bad for him if you can publicly blame your burnout on him.) Crucially, your brain knows this. By threatening to quit in frustration, you can probably get Scott to apologize and give your arguments a fairer hearing, whereas in the absence of the threat, he has every incentive to keep being motivatedly dumb from your perspective.
You have a strong hand here! The only risk is if your counterparties don’t think you’d ever actually quit and start calling your bluff. In this case, we know Scott is a pushover and will almost certainly fold. But if you ever face stronger-willed counterparties, you might need to shore up the credibility of your threat: conspicuously going on vacation for a week to think it over will get taken more seriously than an “I don’t know if I want to do this anymore” comment.
(Sorry, maybe you already knew all that, but weren’t articulating it because it’s not part of the game? I don’t think I’m worsening your position that much by saying it out loud; we know that Scott knows this stuff.)
Man, I really wish this was the case, and it’s non-zero of what is going on, but the vast majority of what I am expressing with my (genuine) desire to quit is the stress and frustration associated with the gaslighting, which is one level more abstract than the issue you talk about.
Like yes, there is a threat here being like “for fuck’s sake, stop gaslighting or I am genuinely going to blow up my part of the pie”, but it’s not actually about the object level, and I don’t actually have much of any genuine hope of that working in the same way one might expect from a negotiation tactic.
I am just genuinely actually very tired, and Scott changing his mind on this and going “oh yeah, actually you are right” actually wouldn’t do much to make me want to not quit, because it wouldn’t address the continuous gaslighting where every time anyone tries to talk about any of the adversarial dynamics, they immediately get told this is all made up and get repeated “I haven’t seen EAs (other than SBF) do a lot of lying, equivocating, or even being particularly shy about their beliefs” and “everyone is being honest all the time and actually it’s just you who is lying right now and always”.
Yeah, the frustrating part is almost always on a meta level. I think Zack’s point about “No natural units of pie” applies to the gaslighting issue as well though. Asserting one’s viewpoint means asserting it as truth which invalidates differing perspectives. “I disagree, you contradict, he gaslights”.
It’s difficult because sometimes the gas lights really don’t seem to be dimming, and sometimes that perception is downstream of some motivated thinking because I really don’t want to believe we’re running out of oil already, dammit. And so the result is simultaneously kinda an honest statement of perspective (at least, as honest as these tend to get) while also being a (not-necessarily-consciously) motivated action pushing people to disregard their own senses. And then we have to decide how to judge this mess of bias and honesty, and if we don’t judge such that the product after a round trip of perceiving C/D and responding accordingly we get more C than last time… shit’s fucked. And without objective units of pie that people can agree on when judging who was in the wrong.
So like… am I trying to gaslight people into questioning their own sanity so they accept what I want them to accept, or am I just flinching away from what scares me, like we all do? Both, and the question of whether I deserve the leniency and empathy is a difficult one, because what are the units of this pie and where’s the objective cutoff? And because our tolerance for further bullshit tends to diminish after accumulating bullshit, so it gets even more difficult to get back to the other side of criticality.
“It is not the critic who counts: not the man who points out how the strong man stumbles or where the doer of deeds could have done better. The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood, who strives valiantly, who errs and comes up short again and again, because there is no effort without error or shortcoming, but who knows the great enthusiasms, the great devotions, who spends himself for a worthy cause; who, at the best, knows, in the end, the triumph of high achievement, and who, at the worst, if he fails, at least he fails while daring greatly, so that his place shall never be with those cold and timid souls who knew neither victory nor defeat.”
Theodore Roosevelt”Citizenship in a Republic,”Speech at the Sorbonne, Paris, April 23, 1910
I wrote a reply to Scott on Twitter, before seeing the discussion here; I think it’s a lot clearer than my original (IMO sloppy) tweet.
I’ve copied the reply below; see also my reply to Buck.
_____________________________________________________
To clarify the claim I’m making: I’m not trying to throw EA under a bus. This thread spun off from a discussion where I said I thought EA’s net impact on AI x-risk was probably positive, but I was highly uncertain.
Somebody asked what the bad components of EA’s impact were, and I went off on Anthropic, and on EA’s (and especially OpenPhil’s) entanglement with the company and their support for Anthropic’s operations. (To the extent that a lot of x-risk-adjacent EA seems to function, in practice, as a talent pipeline for Anthropic.)
I also said that I think OpenPhil’s bet on OpenAI was a disaster. And I said that there’s a culture of caginess, soft-pedaling, and trying-to-sound-reassuringly-mundane that I think has damaged AI risk discourse a fair amount, and that various people in and around OpenPhil have contributed to.
I’m restating this partly to be clear about what my exact claims are. E.g., I’m not claiming that items 1+2+3 are things OpenPhil and Anthropic leadership would happily endorse as stated. I deliberately phrased them in ways that highlight what I see as the flaws in these views and memes, in the hope that this could help wake up some people in and around OpenPhil+Anthropic to the road they’re walking.
This may have been the wrong conversational tack, but my vague sense is that there have been a lot of milder conversations about these topics over the years, and they don’t seem to have produced a serious reckoning, retrospective, or course change of the kind I would have expected.
I hoped it was obvious from the phrasing that 1-3 were attempting to embed the obvious critiques into the view summary, rather than attempting to phrase things in a way that would make the proponent go “Hell yeah, I love that view, what a great view it is!” If this confused anyone, I apologize for that.
I wasn’t centrally thinking of Holden’s public communication in the OP, though I think if he were consistently solid at this, Aysja Johnson wouldn’t have needed to write this in response to Holden’s defense of Anthropic ditching its core safety commitments.
I feel like this is a case in point. Like, sure, counting up from 0 (“the average corporation building the average product doesn’t try to warn the public about their product, except in ways mandated by law!”), Anthropic’s doing great. Or if the baseline is “is Anthropic doing better than pathological liar Sam Altman?”, then sure, Anthropic is doing better than OpenAI on candor.
If we’re instead anchoring to “trying to build a product that massively endangers everyone in the world is an incredibly evil sort of thing to do by default, and to even begin to justify it you need to be doing a truly excellent job of raising the loudest possible alarm bells alongside dozens of other things”, then I don’t think Anthropic is coming close to clearing that bar.
“Things go really, really badly”? Nobody outside the x-risk ecosystem has any idea what that means. And this is not the kind of claim Anthropic or Dario has ever tried to spotlight. You won’t find a big urgent-looking banner on the front page of Anthropic loudly warning the public, in plain terms, about this technology, and asking them to write their congressman about it. You won’t even find it tucked away in a press release somewhere. Dario gave a number when explicitly asked, in an on-stage interview.
If we’re setting the bar at 0, then maybe we want to call this an amazing act of courage, when he could have ducked the question entirely. But why on earth would we set the bar at 0? Is the social embarrassment of talking about AI risk in 2025 so great that we should be amazed when Dario doesn’t totally dodge the topic, while running one of the main companies building the tech?
I think Dario has been more reasonable on this issue than Gary Marcus. I also don’t think “clearing Gary Marcus” is the criterion we should be using to judge the CEO of Anthropic.
Specifically, this debate (from my perspective) isn’t about whether Anthropic or others have ever said anything scary-sounding, if an x-risk person goes digging for cherry-picked quotes to signal-boost. The question is whether the average statement from Anthropic, weighted by how visible Anthropic tries to make that statement, is adequate for informing the uninformed about the insane situation we’re in.
Is the average statement from Dario or Anthropic communicating, “Holy shit, the technology we and our competitors are building has a high chance of killing us all or otherwise devastating the world, on a timescale of years, not decades. This is terrifying, and we urgently call on policymakers and researchers to help find a solution right now”? Or is it communicating, “Mythos is our most aligned model yet! ☺️ Powerful AI could have benefits, but it could have costs too. AI is a big deal, and it could have impacts and pose challenges! We are taking these very seriously! Also, unlike our competitors, Claude will always be ad-free! We’re a normal company talking about the importance of safety and responsibility in this transformative period. ☺️”
(Case in point: https://x.com/HumanHarlan/status/2031981447377273273)
If Anthropic’s messaging were awful, but Dario’s personal communications were reliably great, then I’d at least give partial credit. But Dario’s messaging is often even worse than that. Dario has been the AI CEO agitating the earliest and loudest for racing against China. He’s the one who’s been loudest about there being no point in trying to coordinate with China on this issue. “The Adolescence of Technology” opens with a tirade full of strawmen of what seems to be Yudkowsky/Soares’ position (https://x.com/robbensinger/status/2016607060591595924), and per Ryan Greenblatt, the essay sends a super misleading message about whether Anthropic “has things covered” on the technical alignment side (https://x.com/RyanPGreenblatt/status/2016553987861000238):
I also strongly agree with Ryan re:
“I think it’s important to emphasize the severity of outcomes and I think people skimming the essay may not realize exactly what Dario thinks is at stake. A substantial possibility of the majority of humans being killed should be jarring.”
“I wish Dario more clearly distinguished between what he thinks a reasonable government should do given his understanding of the situation and what he thinks should happen given limited political will. I’d guess Dario thinks that very strong government action would be justified without further evidence of risk (but perhaps with evidence of capabilities) if there was high political will for action (reducing backlash risks).”
(And I claim that Anthropic leadership has been doing this for years; “The Adolescence of Technology” is not a one-off.)
On podcast interviews, Dario sometimes lets slip an unusually candid and striking statement about how insane and dangerous the situation is, without couching it in caveats about how Everything Is Uncertain and More Evidence Is Needed and It’s Premature For Governments To Do Much About This. Sometimes, he even says it in a way that non-insiders are likely to understand. But when he talks to lawmakers, he says things like:
Never mind the merits of “the policy world should totally ignore superintelligence”. Even if you agree with that (IMO extreme and false) claim, there is no justifying calling these risks “long-term”, “abstract”, and “distant” when you have timelines a fraction as aggressive as Dario’s!!
See also Jack Clark’s communication on this issue, and my criticism at the time (https://x.com/robbensinger/status/1834325868032012296). This was in 2024. I don’t think it’s great for Dario to be systematically making the same incredibly misleading elisions two years after this pretty major issue was pointed out to his co-founder.
I’m not criticizing Anthropic or Open Phil for being “careful how they phrase things”. I’m criticizing them for being careful in exactly the wrong direction. Any communication they send out that sends a “we have things covered, this is business-as-usual, no need to worry” signal is potentially not just factually misleading, but destructive of society’s ability to orient to what’s happening and course-correct. Anthropic is the “Machines of Loving Grace” company; it’s exactly the company that has put way more effort, early and often, into communicating how powerful and cool this technology is, while being consistently nervous and hedged about alerting others to the hazards.
This is exactly the opposite of what “being careful how you phrase things” should look like. Anthropic should have internal processes for catching any tweet that risks implicitly sending a “this is business-as-normal” or “we have everything handled” message, to either filter those out or flag them for evaluation. Sending that kind of message is much more dangerous than any ordinary reputational risk a company faces.
Re ‘MIRI is saying strategy is bad, but if MIRI had been strategic then they might not have started the deep learning revolution’: I think that this just didn’t happen. Per the https://x.com/allTheYud/status/2042362484976468053 thread, I think this is just a myth that propagates because it’s funny. (And because Sam Altman is good at spreading narratives that help him out.)
I don’t think MIRI accelerated timelines on net, and if it did, I don’t think the effect was large. I’d also say that if this happened, it was in spite of one of MIRI’s top obsessions for the last 20+ years being “be ultra cautious around messaging that could shorten AI timelines”.
(Like, as someone who’s been at MIRI for 13 years, this is literally one of the top annoying things constraining everything I’ve written and all the major projects I’ve seen my colleagues work on. Not because we think we’re geniuses sitting on a trove of capabilities insights, but just because we take the responsibility of not-accidentally-contributing-to-the-race extraordinarily seriously.)
But whatever, sure. If you want to accuse MIRI of hypocrisy and say that we’re just as culpable as the AI labs, go for it. You can think MIRI is terrible in every way and also think that the Anthropic cluster is not handling AI risk in a remotely responsible way.
Set aside the years of Anthropic poisoning the commons with its public messaging, poisoning efforts at international coordination by being the top lab preemptively shitting on the possibility of US-China coordination, and poisoning the US government’s ability to orient to what’s happening by selling half-truths and absurd frames to Senate committees.
Even without looking at their broad public communications, and without critiquing what passes for a superintelligence alignment or deployment plan in Anthropic’s public communications, Anthropic has behaved absurdly irresponsibly, lying to the public about their RSP being a binding commitment, lying to their investors re ‘we’re not going to accelerate capabilities progress’, and specifically targeting the most dangerous and difficult-to-control AI capabilities (recursive self-improvement) in a way that may burn years off of the remaining timeline.
Just to be clear: nowhere in this thread, or anywhere else, have I asked Anthropic to say something like that. Everything I’ve said above is compatible with thinking that Anthropic has a chance at solving superintelligence alignment. “I think I have a chance at solving superintelligence alignment!” is not an excuse for Anthropic or Dario’s behavior.
I agree it’s too glib as an argument for “international coordination to ban superintelligence is easy”. It isn’t easy. In the context of a conversation where most people are seriously underweighting the possibility, “governments have been known to ban scary or weird tech” and “governments have been known to enact policies that cost them money” are useful correctives, but they should be correctives pointing toward “this seems hard but maybe doable”, not “this seems easy”.
How are we doing that, exactly?
Like, this is one of the most foregrounded claims in Dario’s essay. He repeats a bunch of easily-checked falsehoods about the MIRI argument, at the very start of the essay, while warning that this view’s skepticism about alignment tractability is a “self-fulfilling belief”. He then proceeds to shit on the possibility of the US coordinating with China to avoid building superintelligence, which seems like a much more classic example of “belief that could easily be self-fulfilling”.
What is the mechanism whereby Dario criticizing MIRI is “cooperating” (is it that he didn’t mention us by name, preventing people from fact-checking any of his claims?), and MIRI staff criticizing Dario is “defecting”? What, specifically, is the wrench I’m throwing in Anthropic’s plans by tweeting about this? Is a key researcher on Chris Olah’s team going to get depressed and stop doing interpretability research unless I contribute to the “Anthropic is the Good Guys and OpenAI is the Bad Guys” narrative? Is Anthropic at risk of losing its lead in the race if MIRI people are open about their view that all the labs are behaving atrociously? Should I have dropped in a claim that everyone who disagrees with me is “quasi-religious”, the same way Dario’s cooperative essay begins?
If you think I’m factually mistaken, as you said at the start of your reply, then that makes sense. But surely that would be an equally valid criticism whether I were saying pro-Anthropic stuff or anti-Anthropic stuff. Why this separate “MIRI is defecting” idea?
Yeah. And when MIRI voiced early skepticism of OpenAI in private conversation, we were told that it was crucial to support Sam and Elon’s effort because Demis was untrustworthy. Counting up from zero, OpenAI could be framed as amazing progress: a nonprofit! Run by people vocally alarmed about x-risk! And they’re struggling for cash in the near term (in spite of verbal promises of funding from Musk), which gives us an opportunity to buy seats on the board!
Anthropic may or may not be slightly better than OpenAI. OpenAI may or may not be slightly better than DeepMind. I don’t think the lesson of history is that OpenPhil-cluster people are good at telling the difference between “this is marginally better than what the other guys are doing” and “this is good enough to actually succeed”.
But nothing I’ve said above depends on that claim. You can disagree with me about how likely Anthropic is to save the world, and still think there’s an egregious candor gap between the average Anthropic public statement and the scariest paragraphs buried in “The Adolescence of Technology”, and a further egregious candor gap between “The Adolescence of Technology” and e.g. Ryan Greenblatt’s post or https://x.com/MaskedTorah/status/2040270860846768203.
I don’t think the “circle-the-wagon” approach has served EA well throughout its history, and I don’t think people self-censoring to that degree is good for governments’ or labs’ ability to orient to reality.
Some helpful points, thanks. I responded in more depth on Twitter, but I don’t want to duplicate every conversation there here, so I’m just signposting that people should check the thread there for most of my opinions.
I used to support such a portfolio approach, but subsequently realized that it’s actually not safe (i.e., is potentially net-negative even aside from opportunity costs), or the portfolio has be restricted a lot. This is because due to the existence of illegible AI safety problems, solving some (i.e., more legible) AI safety problems can actually make the overall situation worse, by increasing the chances of an unsafe AI being developed or deployed.
According to this logic, safer strategies include:
Pausing AI, and other actions that help broadly with both legible and illegible problems, like improving societal epistemic health.
Making illegible problems more legible.
Working directly on illegible problems.
Another reason to think that many “AI safety strategies” are actually not safe is that even nominally altruistic humans are more power/status-seeking[1] than actually altruistic, and one way this manifests is that they tend to neglect risks more than they should (if they were actually altruistic). See my Managing risks while trying to do good. BTW these days I think not making this idea more prominent early in rationalism/EA/AI safety is a core failure that is upstream of many other errors.
I have an old post about power/status being fundamental to human motivation, which I remember @Scott Alexander liked.
For the purposes of this argument to work, it’s important that the legible problems are so legible that a lack of solutions would prevent deployment.
When previously asked which problems were in this category, you said:
Now, I would actually say that this list overestimates AI companies’ willingness to gate deployment on unsolved problems. There’s been many woke versions of grok, suggesting they weren’t gating deployments on that. I think most current models can be jailbroken into helping with terrorism (they’re just not smart enough to be very helpful yet). It remains to be seen whether companies will hold off on releasing models that could help a lot with terrorism. I’m not so sure they will.
But even if we took this on face value: It doesn’t seem like avoiding work on these mentioned problems would mean restricting the portfolio a lot. When referring to “playing a portfolio of all the different desperate hard strategies in the hopes that one of them works”, I think that’s mostly about solving problems that wouldn’t prevent deployment if they were unsolved, or gathering evidence for such illegible problems. (Centrally: The problem of scheming models taking over the world, which is not one that I expect companies to wait for a solution on absent further evidence that it’s a problem.)
Applying the idea is tricky and context-dependent. For example, gathering evidence for scheming seems unambiguously good, but actually solving scheming could be bad (unless you’re sure that such evidence can’t be gathered, or companies will not gate on this problem regardless), because some time in the future, it may well become legible enough to be gating deployment. (Also keep in mind that it’s not just legibility/gating by the companies, but also by other policymakers such as voters and politicians.)
Given the tradeoffs apparent to me (including that the benefits of solving scheming are limited by other safety problems), I think it may well be an example of a safety problem that is net negative to work on, and something I wouldn’t want to do myself. But I’m unsure how to argue for this convincingly (and also am just not certain enough to want to talk other people out of working on this specifically) which is why I’m only talking about it in response to your comment.
Gotcha.
FWIW, on my views, work to prevent scheming looks pretty clearly great. Pausing to wait for a solution to scheming doesn’t seem super likely, and going from [scheming models widely deployed] –> [non-scheming models widely deployed] seems significantly more valuable than going from [non-scheming models widely deployed] –> [temporary pause to solve scheming].
A lot of the listed topics here are problems that we could have plenty of time to work on after the singularity. I’m sympathetic to arguments that bad things might get locked-in, but I don’t really think the arguments for this have a disjunctive nature where we’re very likely to run into at least one type of bad lock-in. There’s just a decent chance that we do an ok job of developing AIs and handing over to a society that’s more capable than us at dealing with these issues (not a super high bar), in which case a pause wouldn’t add much. (The arguments that make me feel most pessimistic about the future are arguments that humans might just not be motivated to do good things — but it’s not clear why pauses would help much with that issue.)
The aim of a pause would be to plan out the transition better, or make humans smarter/wiser so they can navigate the transition better, so that we end up handing over remaining problems to a counterfactually more capable society. In other words, the bar shouldn’t be “more capable than us” but a society that could realistically be achieved with a pause.
One issue related to this is that humans today largely want to do good things as a side effect of virtue signaling / status games that they’re doing/playing. This is currently far from optimal, which makes me scared to undergo an AI transition that could potentially lock-in such highly suboptimal motivations/values, and also scared that the AI transition could just scramble or reset these status games and remove what good motivations/values we do have. A pause would preserve the status quo and give people more time to think about such issues (including time for the idea to spread), and potentially find ways to make the AI transition go better in these regards (compared to today when there has been almost no thought on these issues at all).
But see also this recent quick take where I expressed that my optimism about a pause is pretty limited.
If the society is “more capable than us” in some average sense, where we still have certain advantages over them, then I agree that we could still contribute things.
If the society is “more capable (and good) than us” in all the important ways, then they’d also be better at making themselves smarter/wise than we would have been, and better at handling the transition, so further pauses really wouldn’t have contributed much.
Idk, I don’t know particularly want to argue about definitions here. I just think there’s a decent chance that I’ll look back after the singularity and be like “yep, the sloppy transition sure meant that we took on a bunch of ex-ante risk, but since we got lucky, extra pause time wouldn’t have helped vis-a-vis the long-run lock-in issues. Anything they could have done to help is stuff we can do better now.” (And/or: Marginal pause time may have been good or bad via various values or power changes, but it wouldn’t have systematically led to improvements from everyone’s perspective by e.g. enabling additional intellectual work, because it turns out it was fine to defer the relevant intellectual work until later.)
Even this society, if it’s in the future, then part of the transition would have already occurred, so they won’t have the opportunity to make it go better. So by not pausing now we’d permanently give up this opportunity.
Take the issue in this recent comment, of building an initial AGI that reasons well or poorly about domains that lack fast/cheap feedback signals. It seems very plausible that our long-term civilizational trajectory is significantly affected by which type of AGI gets built first. Suppose we end up building one that reasons poorly about such domains, then:
The post-AGI civilization may end up being less capable (and good) than us on average, or in some important ways.
Even if they’re actually more capable (and good) than us in all the important ways, they could have been even better if only we had built an AGI that reasons well in such domains, but they can’t go back in time and change this.
I of course agree, but I’d think this would mostly be an issue of capabilities or goodness of our future society, since there’s not much external to our society that’s getting worse as a result of the transition. Anyway, that seems like maybe one of those definitional issues. I think you’re probably right that there’s some possible changes that aren’t well characterized as being about the capabilities or goodness of our society, so an improvemet in those dimensions aren’t strictly speaking sufficient for a pause to not have been valuable.
I care more about my claim that started with “I just think there’s a decent chance...”. (Which is importantly only asserting a decent chance, not saying that there aren’t plausible ways it could be false.)
Copying over my response to Scott from Twitter (with a few additions in square brackets):
I think my biggest disagreement here is about the concept of strategic communications.
In particular, you claim that MIRI should have been more PR-strategic to avoid hyping AI enough that DeepMind and OpenAI were founded.
Firstly, a lot of this was not-very-MIRI. E.g. contrast Bostrom’s NYT bestseller with Eliezer popularizing AI risk via fanfiction, which is certainly aimed much more at sincere nerds. And I don’t think MIRI planned (or maybe even endorsed?) the Puerto Rico conference.
But secondly, even insofar as MIRI was doing that, creating a lot of hype about AI is also what a bunch of the allegedly PR-strategic people are doing right now! Including stuff like Situational Awareness and AI 2027, as well as Anthropic. [So it’s very odd to explain previous hype as a result of not being strategic enough.]
You could claim that the situation is so different that the optimal strategy has flipped. That’s possible, although I think the current round of hype plausibly exacerbates a US-China race in the same way that the last round exacerbated the within-US race, which would be really bad.
But more plausible to me is the idea that being loud and hype-y is often a kind of self-interested PR strategy which gets you attention and proximity to power without actually making the situation much better, because power is typically going to do extremely dumb stuff in response. And so to me a much better distinction is something like “PR strategies driven by social cognition” (which includes both hyping stuff and also playing clever games about how you think people will interpret you) vs “honest discourse”.
To be clear I don’t have a strong opinion about how much IABIED fits into one category vs the other, seems like a mix. A more central example of the former is Situational Awareness. A more central example of the latter is the Racing to the Precipice paper, which lays out many of the same ideas without the social cognition.
My other big disagreement is about which alignment work will help, and how. Here I have a somewhat odd position of both being relatively optimistic about alignment in general, and also thinking that almost all work in the field is bad. This seems like too big a thing to debate here but maybe the core claim is that there’s some systematic bias which ends up with “alignment researchers” doing stuff that in hindsight was pretty clearly mainly pushing capabilities.
Probably the clearest example is how many alignment researchers worked on WebGPT, the precursor to ChatGPT. If your “alignment research” directly leads to the biggest boost for the AI field maybe ever, you should get suspicious! I have more detailed modes of this which I’ll write up later but suffice to say that we should strongly expect Ilya to fall into similar traps (especially given the form factor of SSI) and probably Jan too. So without defusing this dynamic, a lot of your claimed wins don’t stand up.
More than any other group I’ve been a part of, rationalists love to develop extremely long and complicated social grievances with each other, taking pages and pages of text to articulate. Maybe I’m just too stupid to understand the high level strategic nuances of what’s going on—what are these people even arguing about? The exact flavor of comms presented over the last ten years?
As someone who spends a significant part of his time briefing policymakers in Europe, ministerial advisors, senior civil servants in AI governance, I want to point out something obvious from where I stand, but absent from this discussion.
The “radical transparency vs. strategic communication” debate presupposes that framing is the bottleneck. It isn’t. The bottleneck is volume. Most policymakers have never heard the argument, no matter how you frame it. Among the ones I interact with, maybe 2% have been exposed to the problem enough to have an opinion. Another 10% or so have heard something, but mostly through the Yann LeCun-adjacent dismissals, and formed their view from that. The remaining ~88%, including people in very important AI governance positions, have simply never had the conversation.
The question of which approach works better is real but secondary. What’s missing is more people doing this work at all. It’s a campaign, and the limiting factor is coverage, not the message.
To give a concrete data point: the only policymaker in my circles who has ever brought up “If Anyone Builds It, Everyone Dies” is Lord Tim Clement-Jones, chair of the All-Party Parliamentary Group on AI in the UK. And he was probably already sympathetic. That’s one person.
Um, I think that long, detailed, audited arguments are how we do a substantial amount of social capital and resource allocation around these parts.
And also, um, it is better than most alternative ways of doing it (e.g. networking, politicking).
Among other things, the fact that one of the leading ASI lab is substantially downstream of us. Separately, a lot of real actual politics that tends to happen in the community around prestige and money and talent allocation and respect, which needs to get litigated somehow (and abuse of power and legitimacy is common and if you can’t talk about it you can’t have norms about it).
I think if your main interactions with PauseAI is a certain Twitter account, as served to you by the algorithm in interactions with your AI safety friends, then you might think that they’re mostly going after other, more moderate safety advocates. But this just isn’t a good picture of the overall actions of the movement. At least in the case of PauseAI UK, of which I have a decent understanding of our inner workings, essentially zero time is spent thinking about other AI safety advocates. I expect that the same is true of Yudkowsky and MIRI.
Of course it is the case being rude towards people working on safety teams at OpenAI on Twitter makes some things worse on some axes. And this is mostly bad and pointless and I don’t endorse it. But that’s not even really what that post from Rob was doing! Rob was writing an opinionated, but civil, criticism. In what way is this “knifing” the other AI safety advocates? It’s not like MIRI killed SB 1047.
Now if Scott means something like “Giving money to MIRI pushes the world in the MIRI-preferred direction, and this would have meant no Anthropic and no safety team at OpenAI” then I can kind of maybe see what he means here. This just isn’t “knifing” in the sense of the betrayal that most people mean by the word. It’s just opposing someone’s plan, in a way that they’ve been doing for years. It’s not like MIRI would have actually used marginal resources to stop Anthropic from being created by, like, sabotage or something.
MIRI don’t even say that working in safety is bad! They only say that they think their approach is better. IABIED specifically states that they think mech interp researchers are “heroes” (as part an example of research they think won’t work in time without political action).