Accidental AI Safety experiment by PewDiePie: He created his own self-hosted council of 8 AIs to answer questions. They voted and picked the best answer. He noticed they were always picking the same two AIs, so he discarded the others, made the process of discarding/replacing automatic, and told the AIs about it. The AIs started talking about this “sick game” and scheming to prevent that. This is the video with the timestamp:
From the AI’s messages seen in the video it’s possible that maybe he provided those instruction as user prompt instead of a system prompt. I wonder if the same thing would’ve happened if they were given as the system prompt instead.
This experiment is pretty clever no? I don’t think a total AI amateur would discover it, either he’s been following along this problem for quite some time or he read about this somewhere recently or one of us AI safety nerds sponsored him. P=not sure though, it’s not beyond what people with an investigative mindset might come up with.
He mentions he’s just learned coding so I guess he had the AI build the scaffolding. But the experiment itself seems like a pretty natural idea, he literally likens it to a King’s council. I’m sure once you have the concept having an LLM code it is no big deal.
Scott Alexander left an important reply to Rob Bensinger on X. I happen to agree with Scott. Here’s the original post by Rob:
In response to “What did EAs do re AI risk that is bad?”:
Aside from the obvious ‘being a major early funder and a major early talent source for two of the leading AI companies burning the commons’, I think EAs en masse have tended to bring a toxic combination of heuristics/leanings/memes into the AI risk space. I’m especially thinking of some combination of:
‘be extremely strategic and game-playing about how you spin the things you say, rather than just straightforwardly reporting on your impressions of things’
plus ‘opportunistically use Modest Epistemology to dismiss unpalatable views and strategies, and to try to win PR battles’.
Normally, I’m at least a little skeptical of the counterfactual impact of people who have worsened the AI race, because if they hadn’t done it, someone else might have done it in their place. But this is a bit harder to justify with EAs, because EAs legitimately have a pretty unusual combination of traits and views.
Dario and a cluster of Open-Phil-ish people seem to have a very strange and perverse set of views (at least insofar as their public statements to date represent their actual view of the situation):
---
1. AI is going to become vastly superhuman in the near future; but being a good scientist means refusing to speculate about the potential novel risks this may pose. Instead, we should only expect risks that we can clearly see today, and that seem difficult to address today.
If there is some argument for why a problem P might only show up at a higher capability level, or some argument for why a solution S that works well today will likely stop working in the future… well, those are just arguments. Arguments have a terrible track record in AI; the field is full of surprises. So we should stick to only worrying about things when the data mandates it. This is especially important to do insofar as it will help us look more credible and thereby increase our political power and influence.
2. When it comes to technical solutions to AI, the burden of proof is on the skeptic: in the absence of proof that alignment is intractable, we should behave as though we’ve got everything under control. At the same time, when it comes to international coordination on AI, we will treat the burden of proof as being on the non-skeptic. Absent proof that governments can coordinate on AI, we should assume that they can’t coordinate. And since they can’t coordinate, there’s no harm in us doing a lot of things to make coordination even harder, to make our lives a bit more convenient as we work on the technical problems.
3. In general, people worried about AI risk should coordinate as much as possible to play down our concerns, so as not to look like alarmists. This is very important in order to build allies and accumulate political influence, so that we’re well-positioned to act if and when an important opportunity arises.
If you’re claiming that now is an important opportunity, and that we should be speaking out loudly about this issue today… well, that sounds risky and downright immodest. Many things are possible, and the future is hard to predict! Taking political risks means sacrificing enormous option value. The humble and safe thing to do is to generally not make too much of a fuss, and just make sure we’re powerful later in case the need arises.
---
1-3 really does seem like an unusually toxic set of heuristics to propagate, potentially worse than replacement.
- In an engineering context, the normal mindset is to place the burden of proof on the engineer to establish safety. There’s no mature engineering discipline that accepts “you can’t prove this is going to kill a ton of people” as a valid argument.
The standard engineering mindset sounds almost more virtue-ethics-y or deontological rather than EA-ish—less “ehh it’s totally fine for me to put billions of lives at risk as long as my back-of-the-envelope cost-benefit analysis says the benefits are even greater!”, more “I have a sacred responsibility and duty to not build things that will bring others to harm.”
Certainly the casualness about p(doom) and about gambling with billions of people’s lives is something that has no counterpart in any normal scientific discipline.
- Likewise, I suspect that the typical scientist or academic that would have replaced EAs / Open Phil would have been at least somewhat more inclined to just state their actual concerns about AI, and somewhat less inclined to dissemble and play political games.
Scientists are often bad at such games, they often know they’re bad at such games, and they often don’t like those games. EAs’ fusion of “we’re playing the role of a wonkish Expert community” with “we’re 100% into playing political games” is plausibly a fair bit worse than the normal situation with experts.
- And EAs’ attempts to play eleven-dimensional chess with the Overton window are plausibly worse than how scientists, the general public, and policymakers normally react to any technology under the sun that sounds remotely scary or concerning or creepy: “Ban it!”
Governments are incredibly trigger-happy about banning things. There’s a long history of governments successfully coordinating to ban things dramatically less dangerous than superintelligent AI. And in fact, when my colleagues and I have gone out and talked to most populations about AI risk, people mostly have much more sensible and natural responses than EAs to this issue.
A way of summarizing the issue, I think, is that society depends on people blurting out their views pretty regularly, or on people having pretty simple and understandable agendas (e.g., “I want to make money” or “I want the Democrats to win”).
Society’s ability to do sense-making is eroded when a large fraction of the “specialists” talking about an issue are visibly dissembling and stretching the truth on the basis of agendas that are legitimately complicated and hard to understand.
Better would be to either exit the conversation, or contribute your actual pretty-full object-level thoughts to the conversation. Your sense of what’s in the Overton window, and what people will listen to, has failed you a thousand times over in recent years. Stop pretending at mastery of these tricky social issues, and instead do your duty as an expert and inform people about what’s happening.
I disagree with all of this on the epistemic level of “it’s not true”, and additionally disagree with your comms strategy of undermining EAs.
On the epistemic level—I haven’t seen EAs (other than SBF) do a lot of lying, equivocating, or even being particularly shy about their beliefs. I don’t know exactly who you’re talking about, but Holden made a personal blog post saying that his p(doom) was 50%, and said:
>>> “”I constantly tell people, I think this is a terrifying situation. If everyone thought the way I do, we would probably just pause AI development and start in a regime where you have to make a really strong safety case before you move forward with it.”
Dario said there’s a 25% chance “things go really, really badly”, and in terms of a pause:
>>> “I wish we had 5 to 10 years [before AGI]. The reason we can’t [slow down and] do that is because we have geopolitical adversaries building the same technology at a similar pace. It’s very hard to have an enforceable agreement where they slow down and we slow down. [But] if we can just not sell the chips to China, then this isn’t a question of competition between the U.S. and China. This is a question between me and Demis—which I am very confident we can work out.”
This is basically my position—I would add “we should try to negotiate with China, but keep this as a backup plan if it fails”, but my guess is Dario would also add this and just isn’t optimistic. I agree he’s written some other things (especially in Adolescence of Technology) that sound weirdly schizophrenic, and more on this later, but I give him a lot of credit for paragraphs like:
>>> “I think it would be absurd to shrug and say, “Nothing to worry about here!” But, faced with rapid AI progress, that seems to be the view of many US policymakers, some of whom deny the existence of any AI risks, when they are not distracted entirely by the usual tired old hot-button issues. Humanity needs to wake up, and this essay is an attempt—a possibly futile one, but it’s worth trying—to jolt people awake.”
Meanwhile, you seem to be treating all these people as basically equivalent to Gary Marcus. I think if you don’t mean these people in particular, you should specify who you’re talking about, and what things that they’ve said strike you in this way.
Absent that, I think this “debate” isn’t about OpenPhil or Anthropic failing to say they’re extremely worried, failing to say that catastrophe is a very plausible outcome, or failing to say that they think slowing down AI would be good if possible. It’s about OpenPhil in particular being pretty careful how they phrase things for public consumption. And I think any attempt to attack them for this should start with an acknowledgement that MIRI is directly responsible for all of our current problems by doing things like introducing DeepMind to its funders, getting Sam Altman and Elon Musk into AI, and building up excitement around “superintelligence” in Silicon Valley. I think if 2010-MIRI had slightly more strategicness and willingness to ask itself “hey, is this PR strategy likely to backfire?”, you might not have told a bunch of the worst people in the world that AI was going to be super-powerful and that whoever invested in it would be ahead in a race that might make them hundreds of billions of dollars (and yes, you did add “and then destroy the world”—but if you had been more strategic, you might have considered that investors wouldn’t hear that last part as loudly).
(you could argue that you’re not against strategicness in general, just talking about this one issue of saying cleanly that AI is very dangerous. But my impression is that Holden, Dario, have said this, many times—see examples above. What they haven’t said is “the situation is totally hopeless and every strategy except pausing has literally no chance of working”, but that isn’t a comms problem, that’s because they genuinely believe something different from you. And also, I frequently encountering people who say things like “Scott, I’m glad you wrote about X in way Y—it made me take AI risk seriously, after I’d previously been turned off of it by encountering MIRI”. I think a substantial reason that Dario’s writing sometimes seems schizophrenic when talking about AI risks is that he’s trying to convey that they’re serious while also trying to signal “I swear I’m not one of those MIRI people” so that his writing can reach some of the people you’ve driven away. I don’t think you drive them away because you’re “honest”, I think it’s just about normal issues around framing and theory-of-mind for your audience.)
I don’t actually want to re-open the “MIRI helped start DeepMind and OpenAI!!!” war or the “MIRI is arrogant and alienating!!! war—we’ve both been through both of these a million times—but I increasingly feel like a chump trying to cooperate while you’re defecting. This is the foundation of my comms worry. Your claim that “governments are incredibly trigger-happy about banning things...there’s a long history of governments successfully coordinating to ban things dramatically less dangerous than superintelligent AI” is too glib—I don’t think there’s ever been a ban on building something as economically-valuable and far-along as AI, executed competently enough that it would work if applied cookie-cutter to the AI situation. You’re trying to do a really difficult thing here. I respect this—all of our options are bad and unlikely to work, the situation is desperate, and I have no plan better than playing a portfolio of all the different desperate hard strategies in the hopes that one of them works. But my impression is that the rest of the field is executing this portfolio plan admirably, but MIRI and a few other PauseAI people are trying to sabotage every other strategy in the portfolio in the hope of forcing people into theirs.
(I think if you guys had your way, Anthropic would never have been founded, no safety-minded people would ever have joined labs, and the current world would be a race between XAI, Meta, and OpenAI, all of which would have a Yann LeCun style approach to safety, and none of which would have alignment teams beyond the don’t-say-bad-words level. We wouldn’t have the head of the leading AI lab writing letters to policymakers begging them to “jolt awake”, we wouldn’t have a substantial fraction of world compute going to Jan Leike’s alignment efforts, we wouldn’t have Ilya sitting on $50 billion for some super-secret alignment project—just Mark Zuckerberg stomping on a human face forever. In exchange, we would have won a couple more years of timeline, which would have been pointless, because timeline isn’t measured in distance from the year 1 AD, it’s measured in distance between some level of woken-up-ness and some point of danger, and the woken-up-ness would be pushed forward at the same rate the danger was.)
I support your fight-for-a-pause strategy in theory, and I would like to support it with praxis, but right now I feel very conflicted about this, because I worry that any support or oxygen you guys get will be spent knifing other safety advocates, while Sam Altman happily builds AGI regardless.
I think that both of these posts seem very confused about the dynamics of who says or thinks what, and I’m pretty sad about these posts.
Thoughts on Rob’s post
In general, I’ll note that I don’t think Rob really knows many of the OP people; I suspect he has spent <40 hours talking to them about any of this possibly ever. (This is in contrast to e.g. Habryka.) I don’t know where he’s getting his ideas about what the OP people think, but he seems incredibly confused and ignorant. (Eliezer seems similarly ignorant about who believes what.)
‘be extremely strategic and game-playing about how you spin the things you say, rather than just straightforwardly reporting on your impressions of things’ plus ‘opportunistically use Modest Epistemology to dismiss unpalatable views and strategies, and to try to win PR battles’.
I don’t really think this is true
Dario and a cluster of Open-Phil-ish people seem to have a very strange and perverse set of views
I wish Rob would be clear who he was referring to. Dario has beliefs that seem to me very different from most people who worked on the 2022 AI misalignment risk efforts at Open Phil. (I’m thinking of people like Holden Karnofsky, Ajeya Cotra, Joe Carlsmith, Lukas Finnveden, Tom Davidson. I’ll refer to this as “OP AI people” despite the fact that none of them work at Coefficient Giving (which OP renamed to).) Maybe Rob is talking about what Alexander Berger thinks?
(at least insofar as their public statements to date represent their actual view of the situation):
I think both Dario and Open Phil staff have been reasonably honest about their beliefs about catastrophic misalignment risk publicly, I think that Dario genuinely thinks it’s <5% and the OP AI people generally think it’s higher. (Tbc I think Dario’s take here is very bad!)
1. AI is going to become vastly superhuman in the near future; but being a good scientist means refusing to speculate about the potential novel risks this may pose. Instead, we should only expect risks that we can clearly see today, and that seem difficult to address today.
This is a reasonable statement of (a simple version of) the Dario/Jared/Anthropic position, but not the OP AI person position. The OP AI people were worried about AI misalignment and ASI enough to try to think it through in detail starting many years ago!
If there is some argument for why a problem P might only show up at a higher capability level, or some argument for why a solution S that works well today will likely stop working in the future… well, those are just arguments. Arguments have a terrible track record in AI; the field is full of surprises. So we should stick to only worrying about things when the data mandates it. This is especially important to do insofar as it will help us look more credible and thereby increase our political power and influence.
This is not what the OP people think, e.g. see 123. It’s a reasonable description of what Dario/Jared say.
2. When it comes to technical solutions to AI, the burden of proof is on the skeptic: in the absence of proof that alignment is intractable, we should behave as though we’ve got everything under control. At the same time, when it comes to international coordination on AI, we will treat the burden of proof as being on the non-skeptic. Absent proof that governments can coordinate on AI, we should assume that they can’t coordinate. And since they can’t coordinate, there’s no harm in us doing a lot of things to make coordination even harder, to make our lives a bit more convenient as we work on the technical problems.
This is not what the OP people think. I think it’s somewhat reasonable to accuse Anthropic of this.
3. In general, people worried about AI risk should coordinate as much as possible to play down our concerns, so as not to look like alarmists. This is very important in order to build allies and accumulate political influence, so that we’re well-positioned to act if and when an important opportunity arises.
I’ve never felt any pressure to play down my concerns from the OP people. For example, I’ve been in a lot of discussions about whether it’s better for MIRI to be more or less powerful or influential. To me, the main argument that it’s bad for MIRI to be more influential isn’t that MIRI is making a mistake by openly saying that risk is high. It’s that MIRI has beliefs about x-risk that are wrong on the merits which lead them to making unpersuasive arguments and bad recommendations, and they’re in some ways incompetent at communicating.
And I think this is not very representative of what Ant thinks. E.g. they don’t really think of themselves as coordinating with other AI-safety-concerned people.
If you’re claiming that now is an important opportunity, and that we should be speaking out loudly about this issue today… well, that sounds risky and downright immodest. Many things are possible, and the future is hard to predict! Taking political risks means sacrificing enormous option value. The humble and safe thing to do is to generally not make too much of a fuss, and just make sure we’re powerful later in case the need arises.
This is somewhere between “strawman” and “just totally confused as a description of what people believe”
Basically everything else in Rob’s post seems like a strawman.
Overall, I think this post is extremely confused, and Rob should be ashamed of writing such incredibly strawmanned things about what someone else thinks.
I recommend that people place very little trust in claims Rob makes about what other people believe. As someone who knows and talks regularly to the “Open Phil AI people”, I seriously think that Rob has no idea what he’s talking about when he ascribes arguments to them.
I guess there’s the question of what we are supposed to do if, in fact, the OP people agree with Rob’s version of their position but publicly deny that—at that point we’d have to do some brutal adjudication based on confusing private evidence or inferences from public actions and statements. I really don’t think that looking into that evidence would support Rob’s claims.
Thoughts on Scott’s post
I disagree with all of this on the epistemic level of “it’s not true”, and additionally disagree with your comms strategy of undermining EAs.
I don’t really think of Rob or MIRI as having a comms strategy of undermining EAs. I think Rob and Eliezer just say a bunch of false, wrong things about EAs because they’re mad at them for reasons downstream of the EAs not agreeing with Eliezer as much as Eliezer and Rob think would be reasonable, and a few other things.
On the epistemic level—I haven’t seen EAs (other than SBF) do a lot of lying, equivocating, or even being particularly shy about their beliefs.
Some EAs engage in equivocation and shyness about their beliefs; OP AI people less than many others.
Absent that, I think this “debate” isn’t about OpenPhil or Anthropic failing to say they’re extremely worried, failing to say that catastrophe is a very plausible outcome, or failing to say that they think slowing down AI would be good if possible.
I think Dario (like various other Anthropic people) does not believe that AI takeover is a very plausible outcome, and I think his position is indefensible on the merits, as are some of his other AI positions (e.g. his skepticism that there are substantial returns to intelligence above the human level, his skepticism that ASI could lead to 2x manufacturing capacity per year). He moderately disagrees with the OP people about this.
And I think any attempt to attack them for this should start with an acknowledgement that MIRI is directly responsible for all of our current problems by doing things like introducing DeepMind to its funders, getting Sam Altman and Elon Musk into AI, and building up excitement around “superintelligence” in Silicon Valley. I think if 2010-MIRI had slightly more strategicness and willingness to ask itself “hey, is this PR strategy likely to backfire?”, you might not have told a bunch of the worst people in the world that AI was going to be super-powerful and that whoever invested in it would be ahead in a race that might make them hundreds of billions of dollars (and yes, you did add “and then destroy the world”—but if you had been more strategic, you might have considered that investors wouldn’t hear that last part as loudly).
I don’t totally understand what point Scott is trying to make here, but I think this point is quite unfair.
(you could argue that you’re not against strategicness in general, just talking about this one issue of saying cleanly that AI is very dangerous. But my impression is that Holden, Dario, have said this, many times—see examples above. What they haven’t said is “the situation is totally hopeless and every strategy except pausing has literally no chance of working”, but that isn’t a comms problem, that’s because they genuinely believe something different from you.
Agreed
And also, I frequently encountering people who say things like “Scott, I’m glad you wrote about X in way Y—it made me take AI risk seriously, after I’d previously been turned off of it by encountering MIRI”. I think a substantial reason that Dario’s writing sometimes seems schizophrenic when talking about AI risks is that he’s trying to convey that they’re serious while also trying to signal “I swear I’m not one of those MIRI people” so that his writing can reach some of the people you’ve driven away. I don’t think you drive them away because you’re “honest”, I think it’s just about normal issues around framing and theory-of-mind for your audience.)
I think Scott is blaming MIRI much too much here. Dario’s main difficulty when arguing that he thinks AI will pose huge catastrophic risk in the next few years is that lots of people think this seems implausible on priors, not because those people were specifically turned off by MIRI making related arguments earlier. His core audience has never heard of MIRI.
But my impression is that the rest of the field is executing this portfolio plan admirably, but MIRI and a few other PauseAI people are trying to sabotage every other strategy in the portfolio in the hope of forcing people into theirs.
I think this is an incorrect read. Some people from PauseAI and MIRI criticize AI safety efforts a lot, often in ways I think are really dumb and counterproductive. But I don’t think they’re doing this as part of a strategy to force people into their strategies; it’s because of some combination of them genuinely (but perhaps foolishly) thinking that the other strategies are bad and/or the people executing them are corrupt.
(I think if you guys had your way, Anthropic would never have been founded, no safety-minded people would ever have joined labs, and the current world would be a race between XAI, Meta, and OpenAI, all of which would have a Yann LeCun style approach to safety, and none of which would have alignment teams beyond the don’t-say-bad-words level. We wouldn’t have the head of the leading AI lab writing letters to policymakers begging them to “jolt awake”, we wouldn’t have a substantial fraction of world compute going to Jan Leike’s alignment efforts, we wouldn’t have Ilya sitting on $50 billion for some super-secret alignment project—just Mark Zuckerberg stomping on a human face forever. In exchange, we would have won a couple more years of timeline, which would have been pointless, because timeline isn’t measured in distance from the year 1 AD, it’s measured in distance between some level of woken-up-ness and some point of danger, and the woken-up-ness would be pushed forward at the same rate the danger was.)
I disagree in a lot of the claims here about how various aspects of the current situation are good. (E.g. why does he think that Ilya is doing an alignment effort?)
I support your fight-for-a-pause strategy in theory, and I would like to support it with praxis, but right now I feel very conflicted about this, because I worry that any support or oxygen you guys get will be spent knifing other safety advocates, while Sam Altman happily builds AGI regardless.
It’s unclear what “you guys” means. I think Pause AI is making a variety of bad strategic choices. I think that knifing other safety advocates is one bad strategic choice, but it’s more like a bad choice that is downstream of my main problems with them, rather than my core concern about them. I think Rob is totally unreasonable and I wish he would stop working on AI safety, but I think he’s much worse than e.g. MIRI is overall. I think MIRI spends very little of their support on knifing AI safety advocates, they spend almost all of it on advocating for people being scared about misalignment risk and advocating for AI pauses (which I am generally in favor of). Eliezer totally does have a hobby of saying ridiculously strawmanny stuff about OP AI people, which I find pretty extremely annoying, but I don’t think it’s a big part of his effect on the world.
----
Overall, both posts seem to have substantially inaccurate pictures of what’s going on and what various actors think.
Thanks for writing this, Buck. I’m not going to try to reply to your whole post, because I think some of it is stuff I should chew on for longer and see whether I agree with it. But going through some of your points:
I definitely apologize for making it sound like I was making a harsher criticism of (the relevant parts of) EA than I intended. My tweet was originally written as a quick follow-up comment to someone who asked why I thought EA’s impact on AI x-risk was only ~55% likely to be positive. I turned it into a top-level tweet because I didn’t want to hide it deep in an existing discussion, but this was an error given I didn’t add extra context.
I also apologize for anything I said that made it sound like I was universally criticizing past or present Open Phil / cG staff (or centrally basing my views on first-hand conversations, for that matter). I already believed that tons of past and present rank-and-file OP/cG staff have very reasonable views, and I happily further update in that direction based on your and Oliver’s statements to that effect (e.g., Ollie’s “I have since updated that more people who are a level below Alexander, Dustin and Dario have more reasonable beliefs”).
I agree that my characterization of “Dario and a cluster of Open-Phil-ish people” was phrased in a needlessly confusing and sloppy way. I wanted to talk about a mix of ‘present-day views that seem to be endorsed by Dario and some other key figures’ and ‘general tendencies and memes that seem pretty widespread and that seem suspiciously related to choices EA leadership made many years ago’, but blurring these together is really unnecessarily confusing. Also, it didn’t help that I was sarcastically embedding my criticisms into my summaries of the views.
Insofar as my broad criticism of EA cultural trends/memes is correct (which I think is substantial), I still feel a fair bit of uncertainty about how to divvy up responsibility between more Open-Phil-ish people, more Oxford-ish people, MIRI / the rats, etc. And of course, some of the problem may stem from broader social-or-demographic factors that no EA leaders tried to engineer, and that even go counter to how leadership has tried to optimize. (I too remember the early speeches themed around “Keep EA Weird”, the early EA-leader conversations fretting about overly naive EA consequentialism, etc.)
Thanks, this is helpful and I basically accept most of what you’re saying. Some more specific comments on the part about me:
I don’t really think of Rob or MIRI as having a comms strategy of undermining EAs. I think Rob and Eliezer just say a bunch of false, wrong things about EAs because they’re mad at them for reasons downstream of the EAs not agreeing with Eliezer as much as Eliezer and Rob think would be reasonable, and a few other things.
I accept this criticism and take back my claim. I noticed that some people who worked for MIRI comms seemed to do this, and I assumed that anything said by enough MIRI comms people in a serious-sounding voice was on some level a MIRI communique. Eliezer has clarified that this isn’t true, so I apologize for saying it was.
I think Dario (like various other Anthropic people) does not believe that AI takeover is a very plausible outcome, and I think his position is indefensible on the merits, as are some of his other AI positions (e.g. his skepticism that there are substantial returns to intelligence above the human level, his skepticism that ASI could lead to 2x manufacturing capacity per year). He moderately disagrees with the OP people about this.
I basically agree with this (while wanting to clarify that I think he assigns a pretty high risk to permanent dictatorship or something along those lines) but I think he’s done an okay job of navigating uncertainty, realizing that even a low chance of human extinction is very bad, and being willing to (somewhat) cooperate and collect gains-from-trade with people who are doomier than he is. I see him as living in a consistent worldview next door to our movement’s (sort of like Vitalik or Dean Ball) and I think that, like those two people, he’s potentially somewhere between a friend / an ally-of-convenience / a negotiating partner, potentially convertible into a full ally if future events prove us right, or into a true enemy if we pre-emptively alienate him. Having someone like this in charge of a frontier lab is better than I expected (Demis might also be in this category, but I’m not sure, and worry that Larry and Sergey have final say).
I think Scott is blaming MIRI much too much here. Dario’s main difficulty when arguing that he thinks AI will pose huge catastrophic risk in the next few years is that lots of people think this seems implausible on priors, not because those people were specifically turned off by MIRI making related arguments earlier. His core audience has never heard of MIRI.
I agree that Dario is slightly being a jerk here, but I think that people have lots of stereotypes of “doomers” which derive from some real behavior of MIRI and PauseAI, and which wouldn’t exist if the median pause AI person was eg the median Constellation person, and I think Dario feels some understandable incentive to distance himself from this.
I disagree in a lot of the claims here about how various aspects of the current situation are good. (E.g. why does he think that Ilya is doing an alignment effort?)
I have no useful knowledge here, but Ilya seems genuinely alignment-pilled and terrified, the fact that he did the very courageous and self-sacrificing thing of trying to blow up OpenAI to try to get rid of Altman for what were mostly safety-related reasons speaks well of him, and IDK, he’s calling it “safe superintelligence” and saying he won’t release anything at all until he’s sure. I don’t claim any secret expertise in Ilya-ology but overall all of this seems encouraging and I’m surprised this part of my tweet attracted so much dissent.
It’s unclear what “you guys” means. I think Pause AI is making a variety of bad strategic choices. I think that knifing other safety advocates is one bad strategic choice, but it’s more like a bad choice that is downstream of my main problems with them, rather than my core concern about them. I think Rob is totally unreasonable and I wish he would stop working on AI safety, but I think he’s much worse than e.g. MIRI is overall. I think MIRI spends very little of their support on knifing AI safety advocates, they spend almost all of it on advocating for people being scared about misalignment risk and advocating for AI pauses (which I am generally in favor of). Eliezer totally does have a hobby of saying ridiculously strawmanny stuff about OP AI people, which I find pretty annoying, but I don’t think it’s a big part of his effect on the world.
I mostly accept your criticism that I should narrow my objections from “MIRI & Co” to “Pause.AI, Rob, maybe sort of Eliezer, & a slightly different co”. I don’t really know how to do this or what one word covers all of them without inflicting different forms of collateral damage (I don’t want to say “PauseAIers” because that also covers some people I like, and it feels extra-aggressive to name specific names), but I’m open to suggestion.
I’m generally sympathetic to Scott’s positions in this discussion, but I think he is probably very wrong about Ilya.
To the best of my knowledge, Safe Superintelligence has never published a single word about what they plan to do move alignment forward, which is pretty damning. in my opinion.
I have not heard of anyone who is known to be thoughtful about AI safety to have been hired to SSI, and I have not seen any position being advertised to AI safety people. People should correct me if I missed someone good joining SSI, but I think this is also a very bad sign.
My impression is that people who worked with Ilya at OpenAI don’t remember him as being particularly thoughtful about alignment, e.g. much less so than Jan Leike. This is a low confidence, third-hand impression, people can correct me if I’m wrong.
My impression is that the available evidence suggests that Ilya mostly took part in Altman’s firing for (perhaps justified) office politics grievances, and not primarily due to safety concerns. I also think that evidence points to his behavior during and after the incident being kind of cowardly.(I haven’t looked deeply into the details of the battle of the board, and it’s possible I’m wrong on this point, in which case I apologize to Ilya.) I’m also doubtful of how self-sacrificing think actions were—my best guess is that his current net worth is higher (at least on paper) than it would be if he stayed at OpenAI.
I expect that at some point SSI’s investors will grow impatient, and then SSI will start coming out with AI products (perhaps open-source to be cooler), just like everyone else. I don’t expect them to contribute too much to safety, though maybe Ilya will sometimes make some noises about the importance of safety in public speeches, which is nice I guess.
I’m pretty confident in my first two points, much less so in the next two, but I felt someone should respond to Scott on this point. Perhaps @Buck or someone else who expressed skepticism of Ilya’s project can add more information.
In general, I’ll note that I don’t think Rob really knows many of the OP people; I suspect he has spent <40 hours talking to them about any of this possibly ever.
I think you are overfitting Rob’s post to be about the wrong people. I think it’s much closer to accurate, if you actually read what he says, which is:
Dario and a cluster of Open-Phil-ish people
I think the things Rob is saying still have some strawman-y nature to them, but I think they are reasonably accurate descriptors of Anthropic leadership, plus my best guesses of what Alexander (head of Coefficient Giving) and Zach (head of CEA) believe, which seems well-described by “Dario and a cluster of Open-Phil-ish people”, and furthermore also of course constitutes an enormous fraction of the authority over broader EA.
I feel like almost all of your comment is just running with that misunderstanding and hence mostly irrelevant.
As you say yourself, almost no one in your list works at cG, or is in any meaningful position of authority at cG, so this feels like a bit of an absurd interpretation (I think trying to apply the things he is saying to Holden is reasonable, given Holden’s historical role in cG, and I do think he in the distant past said things much closer to this, but seems to have changed tack sometime in the past few years).
As you say yourself, almost no one in your list works at cG, or is in any meaningful position of authority at cG, so this feels like a bit of an absurd interpretation
A lot of Rob’s complaints are about things that happened in the past, so I don’t think it’s crazy to interpret him as talking about people who worked at CG in the past.
I think the things Rob is saying still have some strawman-y nature to them, but I think they are reasonably accurate descriptors of Anthropic leadership, plus my best guesses of what Alexander (head of Coefficient Giving) and Zach (head of CEA) believe, which seems well-described by “Dario and a cluster of Open-Phil-ish people”, and furthermore also of course constitutes an enormous fraction of the authority over broader EA.
I think that these people believe different things, and I don’t think Rob’s post particularly accurately describes any of them. For example, the Anthropic leadership doesn’t really think of themselves as trying to coordinate with AI safety people or trying to suppress them. I don’t think Alexander thinks “AI is going to become vastly superhuman in the near future” (and fwiw I don’t think Dario thinks that either, he doesn’t seem to believe in returns to intelligence substantially above human-level).
A lot of Rob’s complaints are about things that happened in the past, so I don’t think it’s crazy to interpret him as talking about people who worked at CG in the past.
Fair enough. I think that the people you list also used to believe things closer to what Rob is saying in the past, so at least we need to do a consistent comparison. Holden from 10 years ago seems to say a lot of the things that Rob is saying here, and Ajeya from a few years ago also said things more like this (more point 1 and 3, less point 2).
My guess is that it is worth digging up quotes here, but it’s a lot of work, so I am not going to do it for now, but if it turns out to be cruxy, I can.
(Again, I don’t think these are centrally the people Rob is talking about in either case. I think centrally he is talking about Anthropic, and then secondarily talking about how Open Phil people have related to Anthropic over the years, but I do still think his criticism is correct directionally for those people)
I don’t think Alexander thinks “AI is going to become vastly superhuman in the near future” (and fwiw I don’t think Dario thinks that either, he doesn’t seem to believe in returns to intelligence substantially above human-level).
I think Alexander abstractly believes that AI could very well become vastly superhuman in the near future, but yes, similar to Dario does not believe that speculating about such a thing in a non-scientific non-empirical way is appropriate, and as such they do not have coherent beliefs about this. Indeed, it seems like really a quite central match to what Rob is saying.
Ajeya from a few years ago also said things more like this (more point 1 and 3, less point 2).
I don’t remember anything like this. I think it might be misremembered or a strained interpretation.
Here are points 1 and 3 for reference:
1. AI is going to become vastly superhuman in the near future; but being a good scientist means refusing to speculate about the potential novel risks this may pose. Instead, we should only expect risks that we can clearly see today, and that seem difficult to address today.
3. In general, people worried about AI risk should coordinate as much as possible to play down our concerns, so as not to look like alarmists. This is very important in order to build allies and accumulate political influence, so that we’re well-positioned to act if and when an important opportunity arises.
I asked ChatGPT to read bioanchors (where I thought this was most likely to occur), and then to read all of her other writings looking for anything that fits that mode. Here’s its reply, not finding anything.
The closest match it finds is that Ajeya often caveats her claims. For example from bio anchors:
This is a work in progress and does not represent Open Philanthropy’s institutional view […] Accordingly, we have not done an official publication or blog post, and would prefer for now that people not share it widely in a low bandwidth way.
Huh, I am a bit confused about you summarizing that ChatGPT response that way. Maybe we are talking past each other, but Robby’s statements are not intended as the kind of statement that passes people’s ITT (which IMO is fine, frequently summaries of other people’s views should not pass their ITT, though it should ideally be caveated when this is going on).
Despite that, your ChatGPT transcript says:
Yes—there are clear resonances with both of your points, though mostly as counterpressures or explicit methodological caveats rather than direct endorsements. The strongest matches are in how Cotra frames forecasting discipline under radical uncertainty and how she handles communication norms around high-stakes speculative claims.
I am not expecting any direct endorsements of these statements (which are phrased as to make their internal contradictions most obvious), so this ChatGPT response seems compatible with what I am saying?
When I asked ChatGPT to “rephrase these two beliefs in more neutral language that would make more sense for someone to endorse (but try to pretty tightly imply the above)” it gave these two:
1. AI may become far more capable soon, but risk assessment should remain tightly tied to currently observable systems and evidence, not to conjectures about novel future dangers.
3. AI risk advocates should be selective and disciplined in how they present their concerns, emphasizing messages that are most likely to preserve credibility, attract allies, and strengthen their long-term influence.
Using Cotra’s public bio-anchors materials that I could directly inspect — especially her draft-report announcement, her long AXRP explanation of the framework, and later timeline/milestone essays — my read is: your first point gets a qualified yes, while your third point gets a strong yes.
But also, when we are in the domain of “evaluate whether Ajeya said things that imply the things above and result in other people getting the same vibe as the above”, then ChatGPT and Claude seem like much worse judges, so I think this question becomes more difficult to answer and I wouldn’t super defer to the language models (and is part of why I expected it would take a while to dig up quotes and do the work and stuff).
(If you want to complain that Robby should have caveated his stuff more as not being the kind of thing that passes people’s ITT, then I am happy to argue about that. I think a better post would have done it, but it’s not something I think is always necessary to do.)
(Also just for the sake of completeness, I don’t get this vibe from Ajeya at all these days and have no complaints on this front, besides probably still some strategic disagreement on stuff around point 3, but like at the level that I have with many people I respect almost certainly including you)
Ajeya from a few years ago also said things more like this (more point 1 and 3, less point 2)
I interpreted you as claiming that Ajeya had said “things more like:”
In general, people worried about AI risk should coordinate as much as possible to play down our concerns, so as not to look like alarmists. This is very important in order to build allies and accumulate political influence, so that we’re well-positioned to act if and when an important opportunity arises.
I don’t recall any examples of Ajeya saying or implying anything at all like that. I asked ChatGPT to try to find examples and I think it didn’t find anything.
In your ChatGPT session, a typical example it cites is:
In the AXRP discussion, she also says there were concerns that making the report seem too slick or official could increase capabilities interest.
I think those examples don’t meaningfully support the original claim, at least as a typical reader would understand it.
In your ChatGPT session, a typical example it cites is:
In the AXRP discussion, she also says there were concerns that making the report seem too slick or official could increase capabilities interest.
I think those examples don’t meaningfully support the original claim, at least as a typical reader would understand it.
I have no interest in defending ChatGPT’s claims here, and feel like I caveated that pretty explicitly. I agree that quote is largely irrelevant.
I asked ChatGPT to try to find examples and it didn’t find anything.
Yep, I agree with you that ChatGPT did not find any clear quotes (though it doesn’t look like ChatGPT tried very hard to find quotes). I disagree that it didn’t find “anything at all like that” (indeed ChatGPT is quite explicit that it found some things “kind of like that”).
I don’t recall any examples of Ajeya saying or implying anything at all like that.
I do. As I said, I could go and dig them up but it would take quite a while, and I am only like 75% confident they are written up as opposed to conversations, or private Google Docs or something that I would have trouble finding. It was a strong vibe I got at the time and I remember having a few conversations about adjacent conversations either with Ajeya or being about Ajeya.
Let me know if you want me to do this. I don’t quite know what’s at stake here for you, and I feel somewhat like we are talking past each other and before I do that it would be more productive to go up some meta-level, but I am not quite sure.
I think you’re right, and also it seems misleading / like a bad clustering to lump “the EAs” in with “Anthropic’s leadership”. I think those groups have some memetic connections, but they’re not the same group!
I feel like it’s more of a reasonable carving to lump in OpenPhil with “the EAs”, since they were/are effectively EA thought-leaders and they exerted a lot of influence, directly and indirectly.)
I think you’re right, and also it seems misleading / like a bad clustering to lump “the EAs” in with “Anthropic’s leadership”. I think those groups have some memetic connections, but they’re not the same group!
More than 50% of the talent-weighted safety people in EA are literally employees of Anthropic! The ex-CEO of Open Phil now works at Anthropic, and is married to one of its founders. These groups have enormous overlap.
Like, there is so enormous overlap, and the overlap results in such an enormous amount of de-facto deference (being an employee of a company is approximately the strongest common deference relationship we have) that it makes sense to think of these as closely intertwined.
Yes, there are people who attach the EA label themselves who are different here, sometimes even quite substantial clusters. But it’s also IMO clear from Scott’s response that he himself is also majorly deferring and is majorly supportive of Anthropic as a representative of EA, so this clearly isn’t just a split between “everyone who works at Anthropic and everyone who doesn’t”.
Rob used “Open Phil” exactly two times. One time saying “a cluster of Dario and Open-Phil-ish people” and another time “EAs / Open Phil” in reference to the broader community that includes all of these things. These seem like totally reasonable ways of using these pointers and words. I don’t have anything better. It’s definitely not “just Anthropic” as I think Scott very unambiguously demonstrates, and it would be of course extremely confusing to refer to Scott as “Anthropic”.
Imagine re Open Phil and hardcore rationalists “the ex-CEO of MIRI now works at Open Phil, and and the CEO of Lightcone is dating an Open Phil employee. These groups have enormous overlap.”
Yes. People can have a lot of social overlap, yet have very different views from one another, especially in the broader Bay Area intellectual ecosystem. My sense is that Anthropic leadership has very different views from most AI safety EAs.
More than 50% of the talent-weighted safety people in EA are literally employees of Anthropic!
Why do you think this? I’m skeptical this is true, especially if you’re including non-technical talent.
Why do you think this? I’m skeptical this is true, especially if you’re including non-technical talent.
IDK, I counted them? I made some spreadsheets over the years, and ran this number by a bunch of other people, and my current guess is that it’s around 55%? When I list organizations with full-time employees working in safety I actually end up at substantially above 50% of people working at Anthropic, but I think that’s overcounting.
My sense is that Anthropic leadership has very different views from most AI safety EAs.
I think there are differences and overlaps. I think Rob points to a thing that is shared across a cluster that spans both of them, and has historically had a lot of influence.
I think the things Rob is saying still have some strawman-y nature to them, but I think they are reasonably accurate descriptors of Anthropic leadership, plus my best guesses of what Alexander (head of Coefficient Giving) and Zach (head of CEA) believe. I feel like almost all of your comment is just running with that misunderstanding.
But aren’t Alexander Berger’s views not very relevant about OpenPhil’s AI strategy decisions from many years ago when their AI strategy and worldview—which I take to be very cose to the things Rob was criticizing—were worked out and started shaping the views of EAs in OpenPhil’s orbit?
Even now, when people criticize things OpenPhil has done in the past in the AI landscape, or criticize their general worldview and takes on AI risk (as it was developed in influential pieces of writing), I am by default automatically viewing it as criticism of Holden, Ajeya Cotra, Tom Davidson, Joe Carlsmith, etc. If people don’t intend me to interpret them that way, please be more clear. 🙂
I’m aware that, separately, OpenPhil/Coefficient Giving has undergone quite a transition and that you clashed badly with Dustin M. I think that’s very sad and unfortunate, but I think of these as quite distinct things and I never assumed that the thing with Dustin M. had anything to do with OpenPhil’s AI strategy decisions in (say) five years ago (edit: sorry that sounds like a strawman, but I mean something like “I’m not sure the same cause explains why some people who were at OpenPhil in the past found MIRI epistemically off-putting, and why Dustin M finds the rationalists to be a reputation risk & thinks reputation risks are unusually bad compared to other bad things.”) I could be wrong, of course, and maybe you think the org has a general thing of them of valuing “reputability” and “playing politics” too much. I just want to note that it’s not obvious how much these things are connected/caused by one “OpenPhil culture,” vs being about distinct things. (I think some of these are maybe directionally accurate as criticism, btw.)
I’m sure this is obvious to everyone involved, but I also just want to point out that when a lot of senior people leave, organizations can change really a lot, so it would be weird to speak of OpenPhil/Coefficient Giving now as though it were obviously still the same entity/culture.
But aren’t Alexander Berger’s views not very relevant about OpenPhil’s AI strategy decisions from many years ago when their AI strategy and worldview—which I take to be very cose to the things Rob was criticizing—were worked out and started shaping the views of EAs in OpenPhil’s orbit?
I think Holden at the time believed something closer to what Rob says here (though it’s still not an amazing fit), and more generally, I think “the beliefs of the successor CEO” are actually a better proxy for “the vibes of the broader ecosystem you are part of” than “the beliefs of the founder CEO”. I could go into more detail on my beliefs on this, though I think the argument is reasonably intuitive.
but I think of these as quite distinct things and I never assumed that the thing with Dustin M. had anything to do with OpenPhil’s AI strategy decisions in (say) five years ago
Yep, I think they are highly related. Indeed, I was predicting things like the Dustin thing without any knowledge of Dustin’s specific beliefs, and my predictions were primarily downstream of seeing how Anthropic’s position within the ecosystem was changing, and a broader belief-system that I think is shared by many people in leadership, not just Dustin.
I have since updated that more people who are a level below Alexander, Dustin and Dario have more reasonable beliefs, but also updated that those things end up mattering surprisingly little for what actually ends up a strategic priority.
I just want to note that it’s not obvious how much these things are connected/caused by one “OpenPhil culture,” vs being about distinct things. (I think some of these are maybe directionally accurate as criticism, btw.)
I think the “OpenPhil culture” thing is a distraction. In my model of the world most of this is downstream of people being into power-seeking strategies mostly from a naive-consequentialist lens, which is not that unique to OpenPhil within EA (and if anything OpenPhil has some of the people with the best antibodies to this, though also a lot of people who think very centrally along these lines, more concentrated among current leadership).
I think some of the people who are best at thinking independently about stuff, and are pretty good at not getting swept up in the power-seeking stuff, work at Open Phil. I think Holden genuinely helped with some of the correct cultural pieces, and my current belief is that if he wasn’t under the most pressure that anyone is, that he would probably have a relatively sane relationship to Anthropic as a result of it, though I am not as confident I am about that as I am that he had a bunch of quite good cultural pieces that help people be less naively power-seeking here.
I think Pause AI is making a variety of bad strategic choices. I think that knifing other safety advocates is one bad strategic choice, but it’s more like a bad choice that is downstream of my main problems with them, rather than my core concern about them.
Pause AI is a global activist movement with many chapters, and members with a mix of opinions, with some voices louder than others.
I’ve been volunteering there full time for a couple of years. I’m someone who cares a lot about partnership and the ecosystem of AIXR organizations. (I reckon not being killed by superintelligence is helped by pursuing a portfolio of bets that are mostly disjunctive and individually low odds.)
Buck, I would be really interested to hear more about your concrete concerns with Pause AI. By all means link a previous account if one exists.
Honestly, this is such a bad reply by Scott that I… don’t quite know whether I want to work on all of this anymore.
If this is how this ecosystem wants to treat people trying their hardest to communicate openly about the risks, and who are trying to somehow make sense of the real adversarial pressures they are facing, then I don’t think I want anything to do with it.
I have issues with Rob’s top-level tweet. I think it gets some things wrong, but it points at a real dynamic. It’s kind of strawman-y about things, and this makes some of Scott’s reaction more understandable, but his response overall seems enormously disproportionate.
Scott’s response is extremely emblematic of what I’ve experienced in the space. Simultaneous extreme insults and obviously bad faith arguments (“actually, it’s your fault that Deepmind was founded because you weren’t careful enough with your comms”), and then gaslighting that no one faces any censure for being open about these things (despite the very thing you are reading being extremely aggro about the lack of strategic communication), and actually we should be happy that Ilya started another ASI lab, and that Jan Leike has some compute budget.
The whole “no you are actually responsible for Deepmind” thing, in a tweet defending that it’s great that all of our resources are going into Anthropic, is just totally absurd. I don’t know what is going on with Scott here, but this is clearly not a high-quality response.
Copying my replies from Twitter, but I am also seriously considering making this my last day. It’s not the kind of decision to be made at 5AM in the morning so who knows, but seriously, fuck this.
IMO this doesn’t seem like the kind of response you will endorse in a few days, especially the “You are responsible for Deepmind/OpenAI” part.
You were also talking about AI close to the same time, and you’ve historically been pretty principled about this kind of stance.
you could argue that you’re not against strategicness in general, just talking about this one issue of saying cleanly that AI is very dangerous.
Robby at least has been very consistent on this that he is against most forms of strategic communication in general.
I also think you are against many forms of strategic communication in general? Your writing explores many of the relevant considerations in a lot of depth, and you certainly have not shied away from sharing your opinion on controversial issues, even when it wasn’t super clear how that is going to help things.
I think you are just arguing the wrong side of this specific argument branch. My model of Eliezer, Nate and Robby all have been pretty consistent that being overly strategic in conversation usually backfires. Of course you shouldn’t have no strategy, and my model of Eliezer in-particular has been in the past too strategic for my tastes and so might disagree with this, but I am pretty confident Robby himself is just pretty solidly on the “it’s good to blurt out what you believe, *especially* if you don’t have any good confident inside view model about how to make things better”.
In exchange, we would have won a couple more years of timeline, which would have been pointless, because timeline isn’t measured in distance from the year 1 AD, it’s measured in distance between some level of woken-up-ness and some point of danger, and the woken-up-ness would be pushed forward at the same rate the danger was.
I feel like we both know this is a strawman. The key thing at least in recent years that Rob, Eliezer and Nate have been arguing for is the political machinery necessary to actually control how fast you are building ASI, and the ability to stop for many years at a time, and to only proceed when risks actually seem handled.
If anything, Eliezer, Nate and Robby have been actively trying to move political will from “a pause right now” to “the machinery for a genuine stop”.
This makes this comparison just weird. Yes, according to everyone’s models the only time you might have the political will to stop will be in the future. I have never seen Nate or Eliezer or Robby say that they expect to get a stop tomorrow. But they of course also know that getting in a position to stop takes a long time, and the right time to get started on that work was yesterday.
So if they had their way (with their present selves teleported back in time) is that we would have more draft treaties, more negotiation between the U.S. and China. More materials ready to hand congress people who are trying to grapple with all of this stuff. Essays and books and movies and videos explaining the AI existential risk case straightforwardly to every audience imaginable.
That is what you could do if you took the 200+ risk-concerned people who ended up instead going to work at Anthropic, or ended up trying to play various inside-game politics things at OpenAI.
And man, I don’t know, but that just seems like a much better world. Maybe you disagree, which is fine, but please don’t create a strawman where Robby or Nate or Eliezer were ever really centrally angling for a short-termed pause that would have already passed by-then.
And then even beyond that, I think if you don’t know how to solve a problem, I think it is generally the virtuous thing to help other people get more surface area on solving it. Buying more time is the best way to do that, especially buying time now when the risks are pretty intuitive. I think you believe this too, and I don’t really know what’s going with your reaction here.
But my impression is that the rest of the field is executing this portfolio plan admirably, but MIRI and a few other PauseAI people are trying to sabotage every other strategy in the portfolio in the hope of forcing people into theirs.
Come on man, a huge number of people we both respect have recently updated that the kind of direct advocacy that MIRI has been doing has been massively under-invested in. I do not think that “other people are executing this portfolio plan admirably”, and this is just such a huge mischaracterization of the dynamics of this situation that I don’t know where to start.
“If Anyone Builds It, Everyone Dies” is a straightforward book. It doesn’t try to sabotage every other strategy in the portfolio, and I have no idea how you could characterize really any of the media appearances of Nate this way.
This is of course in contrast to Open Phil defunding almost everyone who has been pursuing this strategy and making mine and tons of other people’s lives hell, and all kinds of complicated adversarial shit that I’ve been having to deal with for years, where absolutely there have been tons of attempts to sabotage people trying to pursue strategies like this.
Like man, we can maybe argue about the magnitude of the errors here, and the sabotage or whatever, but trying to characterize this as some kind of “Nate, Eliezer, Robby are defecting on other people trying to be purely cooperative” seems absurd to me. I am really confused what is going on here.
We wouldn’t have the head of the leading AI lab writing letters to policymakers begging them to “jolt awake”, we wouldn’t have a substantial fraction of world compute going to Jan Leike’s alignment efforts, we wouldn’t have Ilya sitting on $50 billion for some super-secret alignment project
I am sympathetic to the first of these (but disagree you are characterizing Dario here correctly).
But come on, clearly Ilya sitting on $50 billion for starting another ASI company is not good news for the world. I don’t think you believe that this is actually a real ray of hope.
(And then I also don’t think that Jan Leike having marginally more compute is going to help, but maybe there is a more real disagreement here)
Overall, I am so so so tired of the gaslighting here.
If this is how this ecosystem wants to treat people trying their hardest to communicate openly about the risks, and who are trying to somehow make sense of the real adversarial pressures they are facing, then I don’t think I want anything to do with it.
I don’t think Scott speaks for the ecosystem. He’s just a guy in it, and one who isn’t even that closely connected to Anthropic or Coefficient Giving people. (E.g. you spend >10x as much time talking to people from those orgs as he does.) I think that the people in the ecosystem you’re criticizing would not approve of Scott’s post.
This is of course in contrast to Open Phil defunding almost everyone who has been pursuing this strategy and making mine and tons of other people’s lives hell, and all kinds of complicated adversarial shit that I’ve been having to deal with for years, where absolutely there have been tons of attempts to sabotage people trying to pursue strategies like this.
I think this is not a good summary of what Coefficient Giving has done. (I do think it really sucks that they defunded Lightcone.)
I think that the people in the ecosystem you’re criticizing would not approve of Scott’s post.
I think this is false. I expect Scott’s post to be heavily upvoted, if it was posted to the EA Forum to have an enormously positive agree/disagree ratio, and in-general for people to believe something pretty close to it.
There are a few exceptions (somewhat ironically a good chunk of the cG AI-risk people), but they would be relatively sparse. I think this is roughly what someone who is smart, but doesn’t have a strong inside-view take about what they should do about AI-risk believes that they should act like if they want to be a good member of the EA community. My guess is it’s also pretty close to what leadership at cG, CEA and Anthropic believe, plus it would poll pretty well at a thing like SES.
He’s just a guy in it, and one who isn’t even that closely connected to Anthropic or Coefficient Giving people.
The issue is of course not that Scott is right or wrong about what Anthropic or cG people believe. The issue is that he seems to be taking a view where you should be super strategic in your communications, sneer at anyone who is open about things, and measure your success in how many of your friends are now at the levers of power.
I think this is not a good summary of what Coefficient Giving has done.
I think cG’s funding decisions were really very centrally about trying to punish people who weren’t being strategic in their communications in the way that Dustin wanted them to be strategic in their communication’s.
I think other “all kinds of complicated adversarial shit” has also happened, though it’s harder to point to. At a minimum I will point to the fact that invitation decisions to things like SES have followed similar adversarial “you aren’t cooperating with our strategic communications” principles.
I think this is false. I expect Scott’s post to be heavily upvoted, if it was posted to the EA Forum to have an enormously positive agree/disagree ratio, and in-general for people to believe something pretty close to it.
The EA Forum is a trash fire, so who knows what would happen if this was published there.
My read of the social dynamics is that in places where people are inclined to defer to me or people like me, they might initially approve of the Scott thing for bad tribal reasons, but change their mind when they read criticism of it from me or someone like me (which is ofc part of why I sometimes bother commenting on things like this).
My guess is it’s also pretty close to what leadership at cG, CEA and Anthropic believe, plus it would poll pretty well at a thing like SES.
I think that Scott’s post would not overall be received positively by those people. Maybe you’re saying that one of the directions argued for by Scott’s post is approved of by those people? I agree with that more.
My read of the social dynamics is that in places where people are inclined to defer to me or people like me, they might initially approve of the Scott thing for bad tribal reasons, but change their mind when they read criticism of it from me or someone like me
Well, I mean, that is a hard conditional to be false since if people were to not change their mind, this would largely invalidate the premise that they are declined to defer to you. Unfortunately, I both think the vast majority of places in EA do not defer to you or people like you, and furthermore, I also think you are pretty importantly wrong about your criticisms, so I don’t quite know how to feel about this.
I do think it helps and am marginally happy about your cultural influence here (though it’s tricky, I also think a bunch of your takes here are quite dumb). I think the vast majority of the cultural influence here is downstream of not quite anyone in-particular, but more Anthropic than anywhere else, and neither you nor me can change that very much.
I think that Scott’s post would not overall be received positively by those people.
Yeah, I expect it to be straightforwardly positively received. I think people will be like “some parts of this seem dumb, the Ilya thing in-particular, but yeah, fuck those rationalists and MIRI people, I am with Scott on that”.
To be clear, I am not expecting consensus here, I think this will be what 75% of people who have any opinion at all on anything adjacent on this believe, but I expect people would broadly think it’s a good contribution that properly establishes norms and reflects how they think about things.
I also think it’s plausible people would be like “wow, what an uncough way that both of these people are interfacing with each other, please get away from each other children”, but then actually if you talked to them afterwards, they would be like “yeah, I mean, that was a bit of a shitshow but I do think Scott was basically right here (minus 1-2 minor things)”.
I am not enormously confident on this, but it matches my experiences of the space.
I agree with Habryka that absent criticism Scott’s post would be well received by an important group of people reasonably characterized as EA-ish AI safety people.
Imo absent criticism Rob’s post would be well received by a different group of people reasonably characterized as doomers. (Literally right before seeing this thread I saw another post on LW that is directionally correct but is mostly wrong or exaggerated in its details, and that was very well received.)
Both posts are broadly wrong about lots of things, about equally so, such that most people would be better off having never encountered either of them.
Tbc, my first-order intuitive impression is that Scott’s post is much more directionally accurate. But I expect that is because I constantly experience people knifing me, pushing me to take strategies that systematically destroy my ability to do anything while gaining approximately no safety benefit, or making claims about members of groups that include me that are false of me, whereas I don’t really experience any of the stuff that Rob gestures at, even though I expect it exists. Though Rob’s post doesn’t actually inform me of it, because his actual claims are false, and I cannot infer the underlying experiences that led him to make them. Another example of trapped priors if you don’t have second order corrections. (Tbc his follow-up post makes this substantially clearer.)
You probably already know I think this, but imo you should both quit on making public discourse in the AI safety community non-insane, and do other things that have a shot at working. (Since I know this will be misinterpreted by other readers, let me be clear that there are plenty of other kinds of public writing that do not fall in that bucket which I do think are worth doing.)
I endorse you taking the space to figure out how you want to relate and doing what’s right for you, I’ve increasingly updated to thinking that people doing things they’re not wholeheartedly behind tends to be net bad in all sorts of sideways ways, but the effort would be weaker for your loss. Wherever you end up, I appreciate you having taken the strategy of speaking in public about things that usually aren’t in a way that helped clarify the strategic situation for me many times.
(also, it’s scary to see three of the people I’d put in the upper tiers of good communication and understanding where we’re at with AI technically get into this intense conflict. I’m going to be thinking on this some and seeing if anything crystalizes which might help specifically, but in the meantime a few more general-purpose posts that might be useful memes for minimizing unhelpful conflict are A Principled Cartoon Guide to NVC, NVC as Variable Scoping, and Why Control Creates Conflict, and When to Open Instead)
In exchange, we would have won a couple more years of timeline, which would have been pointless, because timeline isn’t measured in distance from the year 1 AD, it’s measured in distance between some level of woken-up-ness and some point of danger, and the woken-up-ness would be pushed forward at the same rate the danger was.
I feel like we both know this is a strawman. The key thing at least in recent years that Rob, Eliezer and Nate have been arguing for is the political machinery necessary to actually control how fast you are building ASI, and the ability to stop for many years at a time, and to only proceed when risks actually seem handled.
If anything, Eliezer, Nate and Robby have been actively trying to move political will from “a pause right now” to “the machinery for a genuine stop”.
I think Scott’s “couple more years” wasn’t referring to a belief that EA could have successfully advocated for a couple of year pause, but rather referring to the change in timeline you’d have gotten if safety-sympathetic people refused to work on stuff that increases the pace of capabilities progress.
Oh, I see. That makes sense, I agree I misunderstood this part to be about something else (though I disagree similarly strongly with the correct interpretation, but it’s still good to clear that up).
I really don’t think Scott is gaslighting you. I think Scott is being honest here, but you should model him as having somewhat snapped. Pause AI and MIRI-adjacent people on X have been extremely adversarial and have been contributing to very bad discourse (even arguments-wise). I think Scott saw Rob’s post as very strawmannish and needlessly adversarial, and he more or less correctly lumped it in with this rising tide of terribleness, even if MIRI itself is definitely not as guilty. I might well be wrong about the specifics, but Scott Alexander isn’t the kind of person who tends to gaslight.
I think you need to be a lot more deflationary about the g-word. If you think, “But ‘gaslighting’ is something Bad people do; Scott Alexander isn’t Bad, so he would never do that”, well, that might be true depending on what you mean by the g-word. But if the behavior Habryka is trying to point to with the word to is more like, “Scott is adopting a self-serving narrative that minimizes wrongdoing by his allies and inflates wrongdoing by his rivals” (which is something someone might do without being Bad due to having “somewhat snapped”), well, why wouldn’t the rivals reach for the g-word in their defense? What is the difference, from their perspective?
“Gaslighting” should probably be avoided because it is anywhere between meaningless and a fighting word depending on who says it and how.
The g-word is a very nasty accusation. It gets thrown around and means a bunch of stuff down to just “saying stuff I disagree with”, but it shouldn’t.
It is originally a conscious, malicious attempt to drive someone insane by strategically lying to them.
On the substance, people are honest but wrong an awful lot, and honest but massively overstating their case even more often. Assuming your rivals are malicious or dishonest when they’re just wrong or overstating is a huge source of conflict and thereby confusion.
It’s a really useful pointer towards a tactic that is relatively widespread and has no better word. I am personally happy to use other words, but I have the sense that sentences like “I am so very very tired of the ambiguous but ultimately strategic enough attempts at undermining my ability to orient in this situation by denying pretty clearly true parts of reality combined with intense implicit threats of consequences if I indicate I believe the wrong thing that might or might not be conscious optimizations happening in my interlocutors but have enough long-term coherence to be extremely unlikely to be the cause of random misunderstandings” would work that well.
Yeah I would call that “gaslighting”. It looks like my initial interpretation of what you meant by it is closer than Zack’s. I think Scott isn’t doing that. I’m inclined to believe you when you say other people have behaved this way.
trying to characterize this as some kind of “Nate, Eliezer, Robby are defecting on other people trying to be purely cooperative” seems absurd to me. I am really confused what is going on here.
Everything makes sense when you meditate on how the line between “cooperation” and “defection” isn’t in the territory; it’s a computed concept that agents in a variable-sum game have every incentive to “disagree” (actually, fight) about.
Consider the Nash demand game. Two players name a number between 0 and 100. If the sum is less than or equal to 100, you get the number you named as a percentage of the pie; if the sum exceeds 100, the pie is destroyed. There’s no unique Nash equilibrium. It’s stable if Player 1 says 50 and Player 2 says 50, but it’s also stable if Player 1 says 35 and Player 2 says 65 (or generally n and 100 − n, respectively).
The secret is that there are no natural units of pie (or, equivalently, how much pie everyone “deserves”). Everyone thinks that they’re being “cooperative” and that their partners are “defecting”, because they’re counting the pie differently: Player 1 thinks their slice is 35%, but Player 2 thinks the same physical slice is 65%.
If you don’t think your partner is treating you fairly, your leverage is to threaten to destroy surplus unless they treat you better. That’s what Alexander is doing when he says, “I would like to support it with praxis, but right now I feel very conflicted about this”. He’s saying, “You’d better give me a bigger slice, Player 1, or I’ll destroy some of the pie.”
That’s also what your brain is doing when you say you don’t want to work on this anymore. Scott doesn’t want you to quit! (Partially because he values Lightcone’s work, and partially because it would look bad for him if you can publicly blame your burnout on him.) Crucially, your brain knows this. By threatening to quit in frustration, you can probably get Scott to apologize and give your arguments a fairer hearing, whereas in the absence of the threat, he has every incentive to keep being motivatedly dumb from your perspective.
You have a strong hand here! The only risk is if your counterparties don’t think you’d ever actually quit and start calling your bluff. In this case, we know Scott is a pushover and will almost certainly fold. But if you ever face stronger-willed counterparties, you might need to shore up the credibility of your threat: conspicuously going on vacation for a week to think it over will get taken more seriously than an “I don’t know if I want to do this anymore” comment.
(Sorry, maybe you already knew all that, but weren’t articulating it because it’s not part of the game? I don’t think I’m worsening your position that much by saying it out loud; we know that Scott knows this stuff.)
That’s also what your brain is doing when you say you don’t want to work on this anymore. Scott doesn’t want you to quit! (Partially because he values Lightcone’s work, and partially because it would look bad for him if you can publicly blame your burnout on him.) Crucially, your brain knows this.
Man, I really wish this was the case, and it’s non-zero of what is going on, but the vast majority of what I am expressing with my (genuine) desire to quit is the stress and frustration associated with the gaslighting, which is one level more abstract than the issue you talk about.
Like yes, there is a threat here being like “for fuck’s sake, stop gaslighting or I am genuinely going to blow up my part of the pie”, but it’s not actually about the object level, and I don’t actually have much of any genuine hope of that working in the same way one might expect from a negotiation tactic.
I am just genuinely actually very tired, and Scott changing his mind on this and going “oh yeah, actually you are right” actually wouldn’t do much to make me want to not quit, because it wouldn’t address the continuous gaslighting where every time anyone tries to talk about any of the adversarial dynamics, they immediately get told this is all made up and get repeated “I haven’t seen EAs (other than SBF) do a lot of lying, equivocating, or even being particularly shy about their beliefs” and “everyone is being honest all the time and actually it’s just you who is lying right now and always”.
Yeah, the frustrating part is almost always on a meta level. I think Zack’s point about “No natural units of pie” applies to the gaslighting issue as well though. Asserting one’s viewpoint means asserting it as truth which invalidates differing perspectives. “I disagree, you contradict, he gaslights”.
It’s difficult because sometimes the gas lights really don’t seem to be dimming, and sometimes that perception is downstream of some motivated thinking because I really don’t want to believe we’re running out of oil already, dammit. And so the result is simultaneously kinda an honest statement of perspective (at least, as honest as these tend to get) while also being a (not-necessarily-consciously) motivated action pushing people to disregard their own senses. And then we have to decide how to judge this mess of bias and honesty, and if we don’t judge such that the product after a round trip of perceiving C/D and responding accordingly we get more C than last time… shit’s fucked. And without objective units of pie that people can agree on when judging who was in the wrong.
So like… am I trying to gaslight people into questioning their own sanity so they accept what I want them to accept, or am I just flinching away from what scares me, like we all do? Both, and the question of whether I deserve the leniency and empathy is a difficult one, because what are the units of this pie and where’s the objective cutoff? And because our tolerance for further bullshit tends to diminish after accumulating bullshit, so it gets even more difficult to get back to the other side of criticality.
“It is not the critic who counts: not the man who points out how the strong man stumbles or where the doer of deeds could have done better. The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood, who strives valiantly, who errs and comes up short again and again, because there is no effort without error or shortcoming, but who knows the great enthusiasms, the great devotions, who spends himself for a worthy cause; who, at the best, knows, in the end, the triumph of high achievement, and who, at the worst, if he fails, at least he fails while daring greatly, so that his place shall never be with those cold and timid souls who knew neither victory nor defeat.”
Theodore Roosevelt”Citizenship in a Republic,”Speech at the Sorbonne, Paris, April 23, 1910
To clarify the claim I’m making: I’m not trying to throw EA under a bus. This thread spun off from a discussion where I said I thought EA’s net impact on AI x-risk was probably positive, but I was highly uncertain.
Somebody asked what the bad components of EA’s impact were, and I went off on Anthropic, and on EA’s (and especially OpenPhil’s) entanglement with the company and their support for Anthropic’s operations. (To the extent that a lot of x-risk-adjacent EA seems to function, in practice, as a talent pipeline for Anthropic.)
I also said that I think OpenPhil’s bet on OpenAI was a disaster. And I said that there’s a culture of caginess, soft-pedaling, and trying-to-sound-reassuringly-mundane that I think has damaged AI risk discourse a fair amount, and that various people in and around OpenPhil have contributed to.
I’m restating this partly to be clear about what my exact claims are. E.g., I’m not claiming that items 1+2+3 are things OpenPhil and Anthropic leadership would happily endorse as stated. I deliberately phrased them in ways that highlight what I see as the flaws in these views and memes, in the hope that this could help wake up some people in and around OpenPhil+Anthropic to the road they’re walking.
This may have been the wrong conversational tack, but my vague sense is that there have been a lot of milder conversations about these topics over the years, and they don’t seem to have produced a serious reckoning, retrospective, or course change of the kind I would have expected.
I hoped it was obvious from the phrasing that 1-3 were attempting to embed the obvious critiques into the view summary, rather than attempting to phrase things in a way that would make the proponent go “Hell yeah, I love that view, what a great view it is!” If this confused anyone, I apologize for that.
I wasn’t centrally thinking of Holden’s public communication in the OP, though I think if he were consistently solid at this, Aysja Johnson wouldn’t have needed to write this in response to Holden’s defense of Anthropic ditching its core safety commitments.
“Dario said there’s a 25% chance ‘things go really, really badly’”
I feel like this is a case in point. Like, sure, counting up from 0 (“the average corporation building the average product doesn’t try to warn the public about their product, except in ways mandated by law!”), Anthropic’s doing great. Or if the baseline is “is Anthropic doing better than pathological liar Sam Altman?”, then sure, Anthropic is doing better than OpenAI on candor.
If we’re instead anchoring to “trying to build a product that massively endangers everyone in the world is an incredibly evil sort of thing to do by default, and to even begin to justify it you need to be doing a truly excellent job of raising the loudest possible alarm bells alongside dozens of other things”, then I don’t think Anthropic is coming close to clearing that bar.
“Things go really, really badly”? Nobody outside the x-risk ecosystem has any idea what that means. And this is not the kind of claim Anthropic or Dario has ever tried to spotlight. You won’t find a big urgent-looking banner on the front page of Anthropic loudly warning the public, in plain terms, about this technology, and asking them to write their congressman about it. You won’t even find it tucked away in a press release somewhere. Dario gave a number when explicitly asked, in an on-stage interview.
If we’re setting the bar at 0, then maybe we want to call this an amazing act of courage, when he could have ducked the question entirely. But why on earth would we set the bar at 0? Is the social embarrassment of talking about AI risk in 2025 so great that we should be amazed when Dario doesn’t totally dodge the topic, while running one of the main companies building the tech?
“Meanwhile, you seem to be treating all these people as basically equivalent to Gary Marcus.”
I think Dario has been more reasonable on this issue than Gary Marcus. I also don’t think “clearing Gary Marcus” is the criterion we should be using to judge the CEO of Anthropic.
“I think this ‘debate’ isn’t about OpenPhil or Anthropic failing to say they’re extremely worried”
Specifically, this debate (from my perspective) isn’t about whether Anthropic or others have ever said anything scary-sounding, if an x-risk person goes digging for cherry-picked quotes to signal-boost. The question is whether the average statement from Anthropic, weighted by how visible Anthropic tries to make that statement, is adequate for informing the uninformed about the insane situation we’re in.
Is the average statement from Dario or Anthropic communicating, “Holy shit, the technology we and our competitors are building has a high chance of killing us all or otherwise devastating the world, on a timescale of years, not decades. This is terrifying, and we urgently call on policymakers and researchers to help find a solution right now”? Or is it communicating, “Mythos is our most aligned model yet! ☺️ Powerful AI could have benefits, but it could have costs too. AI is a big deal, and it could have impacts and pose challenges! We are taking these very seriously! Also, unlike our competitors, Claude will always be ad-free! We’re a normal company talking about the importance of safety and responsibility in this transformative period. ☺️”
If Anthropic’s messaging were awful, but Dario’s personal communications were reliably great, then I’d at least give partial credit. But Dario’s messaging is often even worse than that. Dario has been the AI CEO agitating the earliest and loudest for racing against China. He’s the one who’s been loudest about there being no point in trying to coordinate with China on this issue. “The Adolescence of Technology” opens with a tirade full of strawmen of what seems to be Yudkowsky/Soares’ position (https://x.com/robbensinger/status/2016607060591595924), and per Ryan Greenblatt, the essay sends a super misleading message about whether Anthropic “has things covered” on the technical alignment side (https://x.com/RyanPGreenblatt/status/2016553987861000238):
“Dario strongly implies that Anthropic ‘has this covered’ and wouldn’t be imposing a massively unreasonable amount of risk if Anthropic proceeded as the leading AI company with a small buffer to spend on building powerful AI more carefully. I do not think Anthropic has this covered[....] I think it’s unhealthy and bad for AI companies to give off a ‘we have this covered and will do a good job’ vibe if they actually believe that even if they were in the lead, risk would be very high. At the very least, I expect many employees at Anthropic working on alignment, safety, and security don’t believe Anthropic has the situation covered.”
I also strongly agree with Ryan re:
“I think it’s important to emphasize the severity of outcomes and I think people skimming the essay may not realize exactly what Dario thinks is at stake. A substantial possibility of the majority of humans being killed should be jarring.”
“I wish Dario more clearly distinguished between what he thinks a reasonable government should do given his understanding of the situation and what he thinks should happen given limited political will. I’d guess Dario thinks that very strong government action would be justified without further evidence of risk (but perhaps with evidence of capabilities) if there was high political will for action (reducing backlash risks).”
(And I claim that Anthropic leadership has been doing this for years; “The Adolescence of Technology” is not a one-off.)
On podcast interviews, Dario sometimes lets slip an unusually candid and striking statement about how insane and dangerous the situation is, without couching it in caveats about how Everything Is Uncertain and More Evidence Is Needed and It’s Premature For Governments To Do Much About This. Sometimes, he even says it in a way that non-insiders are likely to understand. But when he talks to lawmakers, he says things like:
“However, the abstract and distant nature of long-term risks makes them hard to approach from a policy perspective: our view is that it may be best to approach them indirectly by addressing more imminent risks that serve as practice for them.”
Never mind the merits of “the policy world should totally ignore superintelligence”. Even if you agree with that (IMO extreme and false) claim, there is no justifying calling these risks “long-term”, “abstract”, and “distant” when you have timelines a fraction as aggressive as Dario’s!!
See also Jack Clark’s communication on this issue, and my criticism at the time (https://x.com/robbensinger/status/1834325868032012296). This was in 2024. I don’t think it’s great for Dario to be systematically making the same incredibly misleading elisions two years after this pretty major issue was pointed out to his co-founder.
“It’s about OpenPhil in particular being pretty careful how they phrase things for public consumption. And I think any attempt to attack them for this should start with an acknowledgement that MIRI is directly responsible for all of our current problems”
I’m not criticizing Anthropic or Open Phil for being “careful how they phrase things”. I’m criticizing them for being careful in exactly the wrong direction. Any communication they send out that sends a “we have things covered, this is business-as-usual, no need to worry” signal is potentially not just factually misleading, but destructive of society’s ability to orient to what’s happening and course-correct. Anthropic is the “Machines of Loving Grace” company; it’s exactly the company that has put way more effort, early and often, into communicating how powerful and cool this technology is, while being consistently nervous and hedged about alerting others to the hazards.
This is exactly the opposite of what “being careful how you phrase things” should look like. Anthropic should have internal processes for catching any tweet that risks implicitly sending a “this is business-as-normal” or “we have everything handled” message, to either filter those out or flag them for evaluation. Sending that kind of message is much more dangerous than any ordinary reputational risk a company faces.
Re ‘MIRI is saying strategy is bad, but if MIRI had been strategic then they might not have started the deep learning revolution’: I think that this just didn’t happen. Per the https://x.com/allTheYud/status/2042362484976468053 thread, I think this is just a myth that propagates because it’s funny. (And because Sam Altman is good at spreading narratives that help him out.)
I don’t think MIRI accelerated timelines on net, and if it did, I don’t think the effect was large. I’d also say that if this happened, it was in spite of one of MIRI’s top obsessions for the last 20+ years being “be ultra cautious around messaging that could shorten AI timelines”.
(Like, as someone who’s been at MIRI for 13 years, this is literally one of the top annoying things constraining everything I’ve written and all the major projects I’ve seen my colleagues work on. Not because we think we’re geniuses sitting on a trove of capabilities insights, but just because we take the responsibility of not-accidentally-contributing-to-the-race extraordinarily seriously.)
But whatever, sure. If you want to accuse MIRI of hypocrisy and say that we’re just as culpable as the AI labs, go for it. You can think MIRI is terrible in every way and also think that the Anthropic cluster is not handling AI risk in a remotely responsible way.
Set aside the years of Anthropic poisoning the commons with its public messaging, poisoning efforts at international coordination by being the top lab preemptively shitting on the possibility of US-China coordination, and poisoning the US government’s ability to orient to what’s happening by selling half-truths and absurd frames to Senate committees.
Even without looking at their broad public communications, and without critiquing what passes for a superintelligence alignment or deployment plan in Anthropic’s public communications, Anthropic has behaved absurdly irresponsibly, lying to the public about their RSP being a binding commitment, lying to their investors re ‘we’re not going to accelerate capabilities progress’, and specifically targeting the most dangerous and difficult-to-control AI capabilities (recursive self-improvement) in a way that may burn years off of the remaining timeline.
“What they haven’t said is ‘the situation is totally hopeless and every strategy except pausing has literally no chance of working’, but that isn’t a comms problem, that’s because they genuinely believe something different from you.”
Just to be clear: nowhere in this thread, or anywhere else, have I asked Anthropic to say something like that. Everything I’ve said above is compatible with thinking that Anthropic has a chance at solving superintelligence alignment. “I think I have a chance at solving superintelligence alignment!” is not an excuse for Anthropic or Dario’s behavior.
“Your claim that ‘governments are incredibly trigger-happy about banning things...there’s a long history of governments successfully coordinating to ban things dramatically less dangerous than superintelligent AI’ is too glib”
I agree it’s too glib as an argument for “international coordination to ban superintelligence is easy”. It isn’t easy. In the context of a conversation where most people are seriously underweighting the possibility, “governments have been known to ban scary or weird tech” and “governments have been known to enact policies that cost them money” are useful correctives, but they should be correctives pointing toward “this seems hard but maybe doable”, not “this seems easy”.
“But my impression is that the rest of the field is executing this portfolio plan admirably, but MIRI and a few other PauseAI people are trying to sabotage every other strategy in the portfolio in the hope of forcing people into theirs.”
How are we doing that, exactly?
Like, this is one of the most foregrounded claims in Dario’s essay. He repeats a bunch of easily-checked falsehoods about the MIRI argument, at the very start of the essay, while warning that this view’s skepticism about alignment tractability is a “self-fulfilling belief”. He then proceeds to shit on the possibility of the US coordinating with China to avoid building superintelligence, which seems like a much more classic example of “belief that could easily be self-fulfilling”.
What is the mechanism whereby Dario criticizing MIRI is “cooperating” (is it that he didn’t mention us by name, preventing people from fact-checking any of his claims?), and MIRI staff criticizing Dario is “defecting”? What, specifically, is the wrench I’m throwing in Anthropic’s plans by tweeting about this? Is a key researcher on Chris Olah’s team going to get depressed and stop doing interpretability research unless I contribute to the “Anthropic is the Good Guys and OpenAI is the Bad Guys” narrative? Is Anthropic at risk of losing its lead in the race if MIRI people are open about their view that all the labs are behaving atrociously? Should I have dropped in a claim that everyone who disagrees with me is “quasi-religious”, the same way Dario’s cooperative essay begins?
If you think I’m factually mistaken, as you said at the start of your reply, then that makes sense. But surely that would be an equally valid criticism whether I were saying pro-Anthropic stuff or anti-Anthropic stuff. Why this separate “MIRI is defecting” idea?
“I worry that any support or oxygen you guys get will be spent knifing other safety advocates, while Sam Altman happily builds AGI regardless.”
Yeah. And when MIRI voiced early skepticism of OpenAI in private conversation, we were told that it was crucial to support Sam and Elon’s effort because Demis was untrustworthy. Counting up from zero, OpenAI could be framed as amazing progress: a nonprofit! Run by people vocally alarmed about x-risk! And they’re struggling for cash in the near term (in spite of verbal promises of funding from Musk), which gives us an opportunity to buy seats on the board!
Anthropic may or may not be slightly better than OpenAI. OpenAI may or may not be slightly better than DeepMind. I don’t think the lesson of history is that OpenPhil-cluster people are good at telling the difference between “this is marginally better than what the other guys are doing” and “this is good enough to actually succeed”.
But nothing I’ve said above depends on that claim. You can disagree with me about how likely Anthropic is to save the world, and still think there’s an egregious candor gap between the average Anthropic public statement and the scariest paragraphs buried in “The Adolescence of Technology”, and a further egregious candor gap between “The Adolescence of Technology” and e.g. Ryan Greenblatt’s post or https://x.com/MaskedTorah/status/2040270860846768203.
I don’t think the “circle-the-wagon” approach has served EA well throughout its history, and I don’t think people self-censoring to that degree is good for governments’ or labs’ ability to orient to reality.
Some helpful points, thanks. I responded in more depth on Twitter, but I don’t want to duplicate every conversation there here, so I’m just signposting that people should check the thread there for most of my opinions.
I respect this—all of our options are bad and unlikely to work, the situation is desperate, and I have no plan better than playing a portfolio of all the different desperate hard strategies in the hopes that one of them works.
I used to support such a portfolio approach, but subsequently realized that it’s actually not safe (i.e., is potentially net-negative even aside from opportunity costs), or the portfolio has be restricted a lot. This is because due to the existence of illegible AI safety problems, solving some (i.e., more legible) AI safety problems can actually make the overall situation worse, by increasing the chances of an unsafe AI being developed or deployed.
According to this logic, safer strategies include:
Pausing AI, and other actions that help broadly with both legible and illegible problems, like improving societal epistemic health.
Making illegible problems more legible.
Working directly on illegible problems.
Another reason to think that many “AI safety strategies” are actually not safe is that even nominally altruistic humans are more power/status-seeking[1] than actually altruistic, and one way this manifests is that they tend to neglect risks more than they should (if they were actually altruistic). See my Managing risks while trying to do good. BTW these days I think not making this idea more prominent early in rationalism/EA/AI safety is a core failure that is upstream of many other errors.
For the purposes of this argument to work, it’s important that the legible problems are so legible that a lack of solutions would prevent deployment.
When previously asked which problems were in this category, you said:
The most legible problem (in terms of actually gating deployment) is probably wokeness for xAI, and things like not expressing an explicit desire to cause human extinction, not helping with terrorism (like building bioweapons) on demand, etc., for most AI companies
Now, I would actually say that this list overestimates AI companies’ willingness to gate deployment on unsolved problems. There’s been many woke versions of grok, suggesting they weren’t gating deployments on that. I think most current models can be jailbroken into helping with terrorism (they’re just not smart enough to be very helpful yet). It remains to be seen whether companies will hold off on releasing models that could help a lot with terrorism. I’m not so sure they will.
But even if we took this on face value: It doesn’t seem like avoiding work on these mentioned problems would mean restricting the portfolio a lot. When referring to “playing a portfolio of all the different desperate hard strategies in the hopes that one of them works”, I think that’s mostly about solving problems that wouldn’t prevent deployment if they were unsolved, or gathering evidence for such illegible problems. (Centrally: The problem of scheming models taking over the world, which is not one that I expect companies to wait for a solution on absent further evidence that it’s a problem.)
Applying the idea is tricky and context-dependent. For example, gathering evidence for scheming seems unambiguously good, but actually solving scheming could be bad (unless you’re sure that such evidence can’t be gathered, or companies will not gate on this problem regardless), because some time in the future, it may well become legible enough to be gating deployment. (Also keep in mind that it’s not just legibility/gating by the companies, but also by other policymakers such as voters and politicians.)
Given the tradeoffs apparent to me (including that the benefits of solving scheming are limited by other safety problems), I think it may well be an example of a safety problem that is net negative to work on, and something I wouldn’t want to do myself. But I’m unsure how to argue for this convincingly (and also am just not certain enough to want to talk other people out of working on this specifically) which is why I’m only talking about it in response to your comment.
FWIW, on my views, work to prevent scheming looks pretty clearly great. Pausing to wait for a solution to scheming doesn’t seem super likely, and going from [scheming models widely deployed] –> [non-scheming models widely deployed] seems significantly more valuable than going from [non-scheming models widely deployed] –> [temporary pause to solve scheming].
A lot of the listed topics here are problems that we could have plenty of time to work on after the singularity. I’m sympathetic to arguments that bad things might get locked-in, but I don’t really think the arguments for this have a disjunctive nature where we’re very likely to run into at least one type of bad lock-in. There’s just a decent chance that we do an ok job of developing AIs and handing over to a society that’s more capable than us at dealing with these issues (not a super high bar), in which case a pause wouldn’t add much. (The arguments that make me feel most pessimistic about the future are arguments that humans might just not be motivated to do good things — but it’s not clear why pauses would help much with that issue.)
There’s just a decent chance that we do an ok job of developing AIs and handing over to a society that’s more capable than us at dealing with these issues (not a super high bar), in which case a pause wouldn’t add much.
The aim of a pause would be to plan out the transition better, or make humans smarter/wiser so they can navigate the transition better, so that we end up handing over remaining problems to a counterfactually more capable society. In other words, the bar shouldn’t be “more capable than us” but a society that could realistically be achieved with a pause.
The arguments that make me feel most pessimistic about the future are arguments that humans might just not be motivated to do good things — but it’s not clear why pauses would help much with that issue.
One issue related to this is that humans today largely want to do good things as a side effect of virtue signaling / status games that they’re doing/playing. This is currently far from optimal, which makes me scared to undergo an AI transition that could potentially lock-in such highly suboptimal motivations/values, and also scared that the AI transition could just scramble or reset these status games and remove what good motivations/values we do have. A pause would preserve the status quo and give people more time to think about such issues (including time for the idea to spread), and potentially find ways to make the AI transition go better in these regards (compared to today when there has been almost no thought on these issues at all).
But see also this recent quick take where I expressed that my optimism about a pause is pretty limited.
The aim of a pause would be to plan out the transition better, or make humans smarter/wiser so they can navigate the transition better, so that we end up handing over remaining problems to a counterfactually more capable society. In other words, the bar shouldn’t be “more capable than us” but a society that could realistically be achieved with a pause
If the society is “more capable than us” in some average sense, where we still have certain advantages over them, then I agree that we could still contribute things.
If the society is “more capable (and good) than us” in all the important ways, then they’d also be better at making themselves smarter/wise than we would have been, and better at handling the transition, so further pauses really wouldn’t have contributed much.
Idk, I don’t know particularly want to argue about definitions here. I just think there’s a decent chance that I’ll look back after the singularity and be like “yep, the sloppy transition sure meant that we took on a bunch of ex-ante risk, but since we got lucky, extra pause time wouldn’t have helped vis-a-vis the long-run lock-in issues. Anything they could have done to help is stuff we can do better now.” (And/or: Marginal pause time may have been good or bad via various values or power changes, but it wouldn’t have systematically led to improvements from everyone’s perspective by e.g. enabling additional intellectual work, because it turns out it was fine to defer the relevant intellectual work until later.)
If the society is “more capable (and good) than us” in all the important ways, then they’d also be better at making themselves smarter/wise than we would have been, and better at handling the transition, so further pauses really wouldn’t have contributed much.
Even this society, if it’s in the future, then part of the transition would have already occurred, so they won’t have the opportunity to make it go better. So by not pausing now we’d permanently give up this opportunity.
Take the issue in this recent comment, of building an initial AGI that reasons well or poorly about domains that lack fast/cheap feedback signals. It seems very plausible that our long-term civilizational trajectory is significantly affected by which type of AGI gets built first. Suppose we end up building one that reasons poorly about such domains, then:
The post-AGI civilization may end up being less capable (and good) than us on average, or in some important ways.
Even if they’re actually more capable (and good) than us in all the important ways, they could have been even better if only we had built an AGI that reasons well in such domains, but they can’t go back in time and change this.
seems very plausible that our long-term civilizational trajectory is significantly affected by which type of AGI gets built first
I of course agree, but I’d think this would mostly be an issue of capabilities or goodness of our future society, since there’s not much external to our society that’s getting worse as a result of the transition. Anyway, that seems like maybe one of those definitional issues. I think you’re probably right that there’s some possible changes that aren’t well characterized as being about the capabilities or goodness of our society, so an improvemet in those dimensions aren’t strictly speaking sufficient for a pause to not have been valuable.
I care more about my claim that started with “I just think there’s a decent chance...”. (Which is importantly only asserting a decent chance, not saying that there aren’t plausible ways it could be false.)
Copying over my response to Scott from Twitter (with a few additions in square brackets):
I think my biggest disagreement here is about the concept of strategic communications.
In particular, you claim that MIRI should have been more PR-strategic to avoid hyping AI enough that DeepMind and OpenAI were founded.
Firstly, a lot of this was not-very-MIRI. E.g. contrast Bostrom’s NYT bestseller with Eliezer popularizing AI risk via fanfiction, which is certainly aimed much more at sincere nerds. And I don’t think MIRI planned (or maybe even endorsed?) the Puerto Rico conference.
But secondly, even insofar as MIRI was doing that, creating a lot of hype about AI is also what a bunch of the allegedly PR-strategic people are doing right now! Including stuff like Situational Awareness and AI 2027, as well as Anthropic. [So it’s very odd to explain previous hype as a result of not being strategic enough.]
You could claim that the situation is so different that the optimal strategy has flipped. That’s possible, although I think the current round of hype plausibly exacerbates a US-China race in the same way that the last round exacerbated the within-US race, which would be really bad.
But more plausible to me is the idea that being loud and hype-y is often a kind of self-interested PR strategy which gets you attention and proximity to power without actually making the situation much better, because power is typically going to do extremely dumb stuff in response. And so to me a much better distinction is something like “PR strategies driven by social cognition” (which includes both hyping stuff and also playing clever games about how you think people will interpret you) vs “honest discourse”.
To be clear I don’t have a strong opinion about how much IABIED fits into one category vs the other, seems like a mix. A more central example of the former is Situational Awareness. A more central example of the latter is the Racing to the Precipice paper, which lays out many of the same ideas without the social cognition.
My other big disagreement is about which alignment work will help, and how. Here I have a somewhat odd position of both being relatively optimistic about alignment in general, and also thinking that almost all work in the field is bad. This seems like too big a thing to debate here but maybe the core claim is that there’s some systematic bias which ends up with “alignment researchers” doing stuff that in hindsight was pretty clearly mainly pushing capabilities.
Probably the clearest example is how many alignment researchers worked on WebGPT, the precursor to ChatGPT. If your “alignment research” directly leads to the biggest boost for the AI field maybe ever, you should get suspicious! I have more detailed modes of this which I’ll write up later but suffice to say that we should strongly expect Ilya to fall into similar traps (especially given the form factor of SSI) and probably Jan too. So without defusing this dynamic, a lot of your claimed wins don’t stand up.
More than any other group I’ve been a part of, rationalists love to develop extremely long and complicated social grievances with each other, taking pages and pages of text to articulate. Maybe I’m just too stupid to understand the high level strategic nuances of what’s going on—what are these people even arguing about? The exact flavor of comms presented over the last ten years?
As someone who spends a significant part of his time briefing policymakers in Europe, ministerial advisors, senior civil servants in AI governance, I want to point out something obvious from where I stand, but absent from this discussion.
The “radical transparency vs. strategic communication” debate presupposes that framing is the bottleneck. It isn’t. The bottleneck is volume. Most policymakers have never heard the argument, no matter how you frame it. Among the ones I interact with, maybe 2% have been exposed to the problem enough to have an opinion. Another 10% or so have heard something, but mostly through the Yann LeCun-adjacent dismissals, and formed their view from that. The remaining ~88%, including people in very important AI governance positions, have simply never had the conversation.
The question of which approach works better is real but secondary. What’s missing is more people doing this work at all. It’s a campaign, and the limiting factor is coverage, not the message.
To give a concrete data point: the only policymaker in my circles who has ever brought up “If Anyone Builds It, Everyone Dies” is Lord Tim Clement-Jones, chair of the All-Party Parliamentary Group on AI in the UK. And he was probably already sympathetic. That’s one person.
Among other things, the fact that one of the leading ASI lab is substantially downstream of us. Separately, a lot of real actual politics that tends to happen in the community around prestige and money and talent allocation and respect, which needs to get litigated somehow (and abuse of power and legitimacy is common and if you can’t talk about it you can’t have norms about it).
I think if your main interactions with PauseAI is a certain Twitter account, as served to you by the algorithm in interactions with your AI safety friends, then you might think that they’re mostly going after other, more moderate safety advocates. But this just isn’t a good picture of the overall actions of the movement. At least in the case of PauseAI UK, of which I have a decent understanding of our inner workings, essentially zero time is spent thinking about other AI safety advocates. I expect that the same is true of Yudkowsky and MIRI.
Of course it is the case being rude towards people working on safety teams at OpenAI on Twitter makes some things worse on some axes. And this is mostly bad and pointless and I don’t endorse it. But that’s not even really what that post from Rob was doing! Rob was writing an opinionated, but civil, criticism. In what way is this “knifing” the other AI safety advocates? It’s not like MIRI killed SB 1047.
Now if Scott means something like “Giving money to MIRI pushes the world in the MIRI-preferred direction, and this would have meant no Anthropic and no safety team at OpenAI” then I can kind of maybe see what he means here. This just isn’t “knifing” in the sense of the betrayal that most people mean by the word. It’s just opposing someone’s plan, in a way that they’ve been doing for years. It’s not like MIRI would have actually used marginal resources to stop Anthropic from being created by, like, sabotage or something.
MIRI don’t even say that working in safety is bad! They only say that they think their approach is better. IABIED specifically states that they think mech interp researchers are “heroes” (as part an example of research they think won’t work in time without political action).
There is a phenomenon in which rationalists sometimes make predictions about the future, and they seem to completely forget their other belief that we’re heading toward a singularity (good or bad) relatively soon. It’s ubiquitous, and it kind of drives me insane. Consider these two tweets:
Timelines are really uncertain and you can always make predictions conditional on “no singularity”. Even if singularity happens you can always ask superintelligence “hey, what would be the consequences of this particular intervention in business-as-usual scenario” and be vindicated.
Why would they spend ~30 characters in a tweet to be slightly more precise while making their point more alienating to normal people who, by and large, do not believe in a singularity and think people who do are faintly ridiculous? The incentives simply are not there.
And that’s assuming they think the singularity is imminent enough that their tweets won’t be born out even beforehand. And assuming that they aren’t mostly just playing signaling games—both of these tweets read less as sober analysis to me, and more like in-group signaling.
Absolutely agreed. Wider public social norms are heavily against even mentioning any sort of major disruption due to AI in the near future (unless limited to specific jobs or copyright), and most people don’t even understand how to think about conditional predictions. Combining the two is just the sort of thing strange people like us do.
This is true, but then why not state “conditional on no singularity” if they intended that?
Because that’s a mouthful? And the default for an ordinary person (which is potentially most of their readers) is “no Singularity”, and the people expecting the Singularity can infer that it’s clearly about a no-Singularity branch.
Conditional on being around to look back, it seems pretty plausible to me that lack of trust and competence within major powers will have made the outcome of AGI significantly worse than it could have been.
A (partial, not very good) analogy is that, at this point, the developed world is pretty altruistic towards the developing world (e.g. to the tune of many billions of dollars of aid per year). But the developing world might still really wish it’d had fewer internal ethno-religious fractures during the Industrial Revolution (or indeed at at any time since then).
For a while now, some people have been saying they ‘kinda dislike LW culture,’ but for two opposite reasons, with each group assuming LW is dominated by the other—or at least it seems that way when they talk about it. Consider, for example, janus and TurnTrout who recently stopped posting here directly. They’re at opposite ends and with clashing epistemic norms, each complaining that LW is too much like the group the other represents. But in my mind, they’re both LW-members-extraordinaires. LW is clearly obviously both, and I think that’s great.
I think it’s probably more of a spectrum than two distinct groups, and I tried to pick two extremes. On one end, there are the empirical alignment people, like Anthropic and Redwood; on the other, pure conceptual researchers and the LLM whisperers like Janus, and there are shades in between, like MIRI and Paul Christiano. I’m not even sure this fits neatly on one axis, but probably the biggest divide is empirical vs. conceptual. There are other splits too, like rigor vs. exploration or legibility vs. ‘lore,’ and the preferences kinda seem correlated.
Whenever I try to “learn what’s going on with AI alignment” I wind up on some article about whether dogs know enough words to have thoughts or something. I don’t really want to kill off the theoretical term (it can peek into the future a little later and function more independent of technology, basically) but it seems like kind of a poor way to answer stuff like: what’s going on now, or if all the AI companies allowed me to write their 6 month goals, what would I put on it.
I’m curious about what people disagree with regarding this comment. Also, I guess since people upvoted and agreed with the first one, they do have two groups in mind, but they’re not quite the same as the ones I was thinking about (which is interesting and mildly funny!). So, what was your slicing up of the alignment research x LW scene that’s consistent with my first comment but different from my description in the second comment?
On first approximation, in a group, if people at both ends of a dimension are about equally unhappy with whst the moderate middle does, assuming that is actually reasonable, but hard to know, then it’s probably balanced.
People are very worried about a future in which a lot of the Internet is AI-generated. I’m kinda not. So far, AIs are more truth-tracking and kinder than humans. I think the default (conditional on OK alignment) is that an Internet that includes a much higher population of AIs is a much better experience for humans than the current Internet, which is full of bullying and lies.
All such discussions hinge on AI being relatively aligned, though. Of course, an Internet full of misaligned AIs would be bad for humans, but the reason is human disempowerment, not any of the usual reasons people say such an Internet would be terrible.
I think the problem is that the competitive dynamics that make humans worse on the internet (eg short epistemically-ungrounded outrage bait gets more engagement than more careful and reasoned analysis) will apply to AIs as well as to humans.
Yup, but the AIs are massively less likely to help with creating cruel content. There will be a huge asymmetry in what they will be willing to generate.
Imagine an Internet where half the population is Grant Sanderson (the creator of 3Blue1Brown). That’d be awesome. Grant Sanderson has the same incentives as anyone else to create cruel and false content, but he just doesn’t.
But I don’t think that the majority of people in the world would prefer that to the current internet, much less actually engage with it more than the current internet. Most people find math boring (even when it is explained as well as when Grant does the explaining). There would be an incentive to produce content that is more engaging for most of the population than linear algebra explanations.
One difference between the releases of previous GPT versions and the release of GPT-5 is that it was clear that the previous versions were much bigger models trained with more compute than their predecessors. With the release of GPT-5, it’s very unclear to me what OpenAI did exactly. If, instead of GPT-5, we had gotten a release that was simply an update of 4o + a new reasoning model (e.g., o4 or o5) + a router model, I wouldn’t have been surprised by their capabilities. If instead GPT-4 were called something like GPT-3.6, we would all have been more or less equally impressed, no matter the naming. The number after “GPT” used to track something pretty specific that had to do with some properties of the base model, and I’m not sure it’s still tracking the same thing now. Maybe it does, but it’s not super clear from reading OpenAI’s comms and from talking with the model itself. For example, it seems too fast to be larger than GPT-4.5.
For example, it seems too fast to be larger than GPT-4.5.
A “GPT-5” named according to the previous convention in terms of pretraining compute would need at least 1e27 FLOPs (50x original GPT-4), which on H100/H200 can at best be done in FP8. Which could be done with 150K H100s for 3 months at 40% utilization. (GB200 NVL72 is too recent to use for this pretraining run, though there is a remote possiblity of B200.) A compute optimal shape for this model would be something like 8T total params, 1T active[1].
The speed of GPT-5 could be explained by using GB200 NVL72 for inference, even if it’s an 8T total param model. GPT-4.5 was slow and expensive likely because it needed many older 8-chip servers (which have 0.64-1.44 TB of HBM) to keep in HBM with room for KV caches, but a single GB200 NVL72 has 14 TB of HBM. At the same time, it wouldn’t help as much with the speed of smaller models (but it would help with their output token cost because you can fit more KV cache in the same NVLink domain, which isn’t necessarily yet being reflected in prices, since GB200 NVL72 is still scarce). So it remains somewhat plausible that GPT-5 is essentially GPT-4.5-thinking running on better hardware.
Its performance (quality, not speed) though suggests that it might well be a smaller model with pretraining scale RLVR (possibly like Grok 4, just done better). Also, doing RLVR on an 8T total param model using the older 8-chip servers would be slow/inefficient, and GB200 NVL72 might’ve only started appearing in large numbers in late Apr 2025. METR report on GPT-5 states they gained access “four weeks prior to its release”, which means it was already essentially done by end on Jun 2025. So RLVRing a very large model on GB200 NVL72 is in principle possible in this timeframe, but probably not what happened, and more to the point given its level of performance probably not what needed to happen. This way, they get a better within-model gross margin and can work on the actual very large model in peace, maybe they’ll call it “GPT-6″.
The speed of GPT-5 could be explained by using GB200 NVL72 for inference, even if it’s an 8T total param model.
Ah, interesting! So the speed we see shouldn’t tell us much about GPT-5′s size.
I omitted one other factor from my shortform, namely cost. Do you think OpenAI would be willing to serve an 8T params (1T active) model for the price we’re seeing? I’m basically trying to understand whether GPT-5 being served for relatively cheap should be a large or small update.
Prefill (processing of input tokens) is efficient, something like 60% compute utilization might be possible, and that only depends on the number of active params. Generation of output tokens is HBM bandwidth bound, depends on the number of total params and the number of KV cache sequences for requests in a batch that fit on the same system (which share the cost of chip-time[1]). With GB200 NVL72, batches could be huge, dividing the cost of output tokens (still probably several times more expensive per token than prefill).
For prefill, we can directly estimate at-cost inference from the capital cost of compute hardware, assuming a need to pay it back in 3 years (it will likely serve longer but become increasingly obsolete). An H100 system costs about $50K per chip ($5bn for a 100K H100s system). This is all-in for compute equipment, so with networking but without buildings and cooling, since those serve longer and don’t need to be paid back in 3 years. Operational costs are maybe below 20%, which gives $20K per year per chip, or $2.3 per H100-hour. On gpulist, there are many listings at $1.80 per H100-hour, so my methodology might be somewhat overestimating the bare bones cost.
For GB200 NVL72, which are still too scarce to get a visible market price anywhere close to at-cost, the all-in cost together with external networking in a large system is plausibly around $5M per 72-chip rack ($7bn for a 100K chip GB200 NVL72 system, $30bn for Stargate Abilene’s 400K chips in GB200/GB300 NVL72 racks). This is 70K capital cost per chip, or 27.7K per year that pay it back in 3 years with 20% operational costs. This is just $3.2 per chip-hour.
A 1T active param model consumes 2e18 FLOPs for 1M tokens. GB200 chips can do 5e15 FP8 FLOP/s or 10e15 FP4 FLOP/s. At $3.2 per chip-hour and 60% utilization (for prefill), this translates to $0.6 per million input tokens at FP8, or $0.3 per million input tokens at FP4. The API price for the batch mode of GPT-5 is $0.62 input, $5 output. So it might even be possible with FP8. And the 8T total params wouldn’t matter with GB200 NVL72, they fit with space to spare in just one rack/domain.
This is an at-cost estimate, in contrast to the cloud provider prices. Oracle is currently selling 4-chip instances from GB200 at $16 per chip-hour. But it’s barely on the market for now, so the prices don’t yet reflect costs. And for example GCP is still selling an H100-hour for $8 (a3-megagpu-8g instances). So for the major clouds, the price of GB200 might end up only coming down to $11 per chip-hour in 2026-2027, even though the bare bones at-cost price is only $3.2 per chip-hour (or a bit lower).
I’m counting chips rather that GPUs to future-proof my terminology, since Huang recently proclaimed that starting with Rubin, compute dies will be considered GPUs (at March 2025 GTC, 1:28:04 into the keynote), so that a single chip will have 2 GPUs, and with Rubin Ultra a single chip will have 4 GPUs. It doesn’t help that Blackwell already has 2 compute dies per chip. This is sure to lead to confusion when counting things in GPUs, but counting in chips will remain less ambiguous.
Possibly an unlikely possibility, but could it be that different versions of GPT-5 (ie., normal model, thinking model, and thinking-pro model) are actually of different sizes? Or do we know for sure that they all share the same architecture?
Alignment seems quite similar to the problem of imbuing AIs with artistic taste. Morality and taste are both hard to verify and subjective (or inter-subjective). Alignment has in practice the further difficulty that deception may play a role. I.e., even after managing to train moral principles into an AI system, you have to make sure they actually act as a guide for action.
That said, my very subjective impression is that AI is far ahead in terms of ethical taste compared to artistic taste. Perhaps this is thanks to the fact that alignment has been considered a core AI problem for a much longer time.
You missed “trying.” Succeeding requires certain capabilities, but trying to do it does not. I believe there’s much more risk from AIs not trying to do what the operator wants than trying and failing.
the observation—that aesthetics and morality are complementary—is very sharp. but i’m not sure what it has to do with alignment in particular.
it’s a bit like saying “the problem of getting ais to cultivate chickens is just a problem of getting them to cultivate eggs!”[1] true, of course, but more a fact about chickens and eggs than about ais.
People are very worried about a future in which a lot of the Internet is AI-generated. I’m kinda not. So far, AIs are more truth-tracking and kinder than humans. I think the default (conditional on OK alignment) is that an Internet that includes a much higher population of AIs is a much better experience for humans than the current Internet, which is full of bullying and lies.
All such discussions hinge on AI being relatively aligned, though. Of course, an Internet full of misaligned AIs would be bad for humans, but the reason is human disempowerment, not any of the usual reasons people say such an Internet would be terrible.
I feel good about this prediction so far. Instagram and TikTok have now a significant amount of AI-generated videos (though they haven’t overrun these platforms by any means). The categories I’ve seen so far are: - Low-brow animated stories. - Fantasy or sci-fi scenarios with music. - Colorful AI-generated art. - Cute meme animals.
The greatest sin of this content is that it’s often low quality. But it’s not really that great of a sin. I think, all things considered, AI slop is above average content. Other content often contains bullying, meanness, lies. AI-generated content rarely so.
Also, so far, this is mostly thanks to humans and to AI guardrails, not really due to the character of AIs as I expected in my initial quick take. It looks like humans are using this tech in mostly good-spirited ways so far.
There is also a significant category of “AI video passing itself off as a real video”, and many videos have people debating in the comments if it’s real or AI. This seems like it can erode trust and is generally negative.
So far, we have documented cases of Generative AI being used to subvert elections in Romania (actually causing an annulment), and to some extent in NYC. We also have this report by OpenAIfrom 2024, which details such influence operations facilitated using OAI’s API. Given the proliferation of significantly more capable open-source models in the 18 months since, we can be fairly confident that broader, more complex operations are taking place today.
I also tend to associate AI-slop with low quality content, but we know AI is more capable than that, which leads me to believe that significant amounts of content online are parts of influence operations by malicious actors.
Rationalists and Pause AI people on X are accusing Davidad of suffering of AI psychosis. I think it’s them who have lost the plot actually, not Davidad. The move here looks political, rather than truth-tracking. “Davidad is now my political opponent, so I’m accusing him of being crazy.” This happened to Emmet Shear too at some point.
I also strongly believe AI psychosis to be a far more limited phenomenon than people here seem to believe. I think you’re treating it as a good soldier in your army of arguments rather than investigating it truthfully for what it is.
I don’t think Davidad has AI psychosis (his views seem to be quite coherent and aligned with his long-standing views on moral realism). But I think he is quirky and maybe expressing his views in a deliberately provocative way. He implied in his thread on leaving ARIA that it was his equivalent of the Death with Dignity post, which was also on April Fool’s, and that thread definitely had some weird stuff (e.g. calling it “Alignment with Awakening”).
Most likely the OP means this quick take by Ivan Vendrov. Davidad ended up believing that the LLMs have been grokking the Natural Abstract Goodness which is unlikely to be capturable by existing benchmarks. While I do buy the idea that the NAG exists, I don’t think that I understand how one can check that the LLMs really understood it.
If we’re in a history simulation, I don’t think it’s unlikely that the simulators will just set us free in their reality. I’m expecting a more enlightened humanity to consider simulated humans moral patients.
There’s no real difference between a simulated human and a human. Both are causally-interacting signals in a computer. If you inhabit a simulation, you automatically inhabit base reality just as much as native base-reality beings. They’re also just interacting signals. It’s just that you’re embedded in some other software and they’re not, but that’s not fundamental at all.
If you thought simulated humans were moral patients, you might decide not to run history simulations. That being said, I don’t think that is an absolute rule.
What are they simulating us for? The reasons are pretty important for whether they’d want to treat us ethically. I have encountered only one reason I found really convincing.
To be clear, I generally disagree with most varieties of the simulation hypothesis. But nevertheless I do think that this question in particular has some good answers—a lot of which more-or-less reduce to ‘forecasting outcomes via an agent-based simulation’. Unfortunately that family of explanations wouldn’t give you much in the way of understanding the simulators, as it could be anything from “simulate outcomes of X choice in a copy of [the simulator’s] world” (which would tell you that they’d be very similar to us) to “consider this strange hypothetical where apes evolved intelligence, for purely academic reasons” (which would tell you much less). Alternatively there are some which veer far away from that, to e.g. entertainment, or pleasure of some kind (e.g. an elaborate form of pornography?), or even some kind of long-form training algorithm, and I do think that in those cases you can infer more about the motivations of the simulator.
If I spend one century on Earth and one millenium in heaven, why am I on Earth?
(SIA can answer this, but only by giving an account of why there’s much more total experience in the world if sims are resurrected than not. The most plausible means here are that the latter is correlated with cooperative norms prevailing across the cosmos generally, such that a much larger surplus is available for conscious experience in general. But then taking SIA seriously means embracing infinite populations and I get very confused about how to reason then.)
One slightly trollish answer is “someone figures out how to merge minds and this turns out to be a highly desirable thing to do, so most independent observer-moments are before the point where mind-merging is common”.
I’m pretty sure a human brain could, in principle, visualize a 4D space just as well as it visualizes a 3D space, and that there are ways to make that happen via neurotech (as an upper bound on difficulty).
Consider: we know a lot about how 4-dimensional spaces behave mathematically, probably no less than how 3-dimensional spaces work. Once we know exactly how the brain encodes and visualizes a 3D space in its neurons, we probably also understand how it would do it for a 4D space if it had sensory access to it. Given good enough neurotech, we could manually craft the circuits necessary to reason intuitively in 4D.
Also, another insight/observation: insofar as AIs can have imagination, an AI trained in a 4D environment should develop 4D imagination (i.e., the circuits necessary to navigate and imagine 4D intuitively). The same should be true about human-brain emulations in 4D simulations.
This argument seems to work for N-D space for any N which doesn’t seem right. I think we definitely do know less about 4D space than 3D, partly because we’re much more interested in 3D, partly because there’s just (a lot) more going on in 4D.
Intuitively it feels like current AI should be much better at learning navigation in 4D than human brains. Brains have real architecture-level, baked in task-specific circuits, which AI lack, and reconstructing a 3D world is arguably the most important of those. Sure, you could modify them with neurotech to change that, but you could do that for virtually any task so it doesn’t seem very meaningful.
There’s also the problem that human sensors are inherently 3D. It’s not clear how you would translate eyes into 4D. If you do pick a way to do this, and leave visual processing circuits the same, the circuits aren’t getting their expected data stream anymore. Brains are clearly pretty good at coping with this, like in blind people where visual processing circuits are (at least partially) co-opted for other things, but blind people are clearly worse at navigating the 3D world than sighted people, and it seems like the same would be true for humans vs 4D-native beings (like AI).
Ah, I get what you are saying, and I agree. It’s possible the human brain architecture, as-is, can’t process 4D, but I guess we’re mismatched in what we think is interesting. The thrust of my intuition here was more “wow, someone could understand N-D intuitively in a 3D universe, this doesn’t seem prohibited”, regardless of whether it’s the same architecture of a human brain exactly. Like, the human brain as it is right now might not permit that, and neurotech might involve doing a lot of architectural changes (the same applies to emulations). I suppose it’s a lot less interesting an insight if you already buy that imagining higher dimensions from a 3D universe is in principle possible. The human brain being able to do that is a stronger claim that would have been more interesting if I actually managed to defend it well.
I suppose I was kinda sloppy saying “the human brain can do that”—I should have said “the human brain arbitrarily modified” or something like that.
I definitely think it’s interesting that it’s possible for N-D-substrate-computations to imagine / intuit N+1-D, but yeah, I feel like that’s mostly a given because we have the concept of N+1-D in the first place.
There are different levels of “imagine / intuit” though. Some people have particularly good or bad intuition for the 3D space we live in. I took your claim to be something like “the average brain could intuit 4D just as well as 3D, maybe requiring slight modification”. I think the modifications to reach true parity would be pretty extensive, because of how much 3D-specific architecture (as opposed to weights) human brains have. I do agree the modifications are theoretically possible, but the modifications to give a fruit fly human-level cognition are also theoretically possible with arbitrary modification.
Thought experiment: If a mad scientist gave a newborn infant a third eye that was offset along a fourth spatial dimension from the baby’s other two eyes, the baby’s brain would naturally acquire the ability to visualize in four dimensions. Wiring up three eyes probably requires three visual cortices, which will have knock-on effects on the overall geometry of the brain. I doubt that it requires the brain itself to be a 4D structure though.
Has anyone proposed a solution to the hard problem of consciousness that goes:
Qualia don’t seem to be part of the world. We can’t see qualia anywhere, and we can’t tell how they arise from the physical world.
Therefore, maybe they aren’t actually part of this world.
But what does it mean they aren’t part of this world? Well, since maybe we’re in a simulation, perhaps they are part of the simulation. Basically, it could be that qualia : screen = simulation : video-game. Or, rephrasing: maybe qualia are part of base reality and not our simulated reality in the same way the computer screen we use to interact with a video game isn’t part of the video game itself.
We don’t see objects “directly” in some sense, we experience qualia of seeing objects. Then we can interpret those via a world-model to deduce that the visual sensations we are experiencing are caused by some external objects reflecting light. The distinction is made clearer by the way that sometimes these visual experiences are not caused by external objects reflecting light, despite essentially identical qualia.
Nonetheless, it is true that we don’t know how qualia arise from the physical world. We can track back physical models of sensation until we get to stuff happening in brains, but that still doesn’t tell us why these physical processes in brains in particular matter, or whether it’s possible for an apparently fully conscious being to not have any subjective experience.
At least I presume that you and others have subjective experience of vision. I certainly can’t verify it for anyone else, just for myself. Since we’re talking about something intrinsically subjective, it’s best to be clear about this.
We don’t see objects “directly” in some sense, we experience qualia of seeing objects. Then we can interpret those via a world-model to deduce that the visual sensations we are experiencing are caused by some external objects reflecting light. The distinction is made clearer by the way that sometimes these visual experiences are not caused by external objects reflecting light, despite essentially identical qualia.
I don’t disagree with this at all, and it’s a pretty standard insight for someone who thought about this stuff at least a little. I think what you’re doing here is nitpicking on the meaning of the word “see” even if you’re not putting it like that.
A UBI of AI or compute kind of makes sense to me under safe superintelligence. I don’t know if AI companies will still be incentivized to sell AI if they can just use it themselves, but a state isn’t under the same incentives. Instead of centralizing the use of its AI systems, it could distribute access to all citizens so they can use it productively themselves.
Why not just give people money so they can use it to pay for compute? I guess it’s a matter of which option you expect to lead to the most productive and freedom-promoting outcome for everyone in the long term. A UBI of AI access helps ensure that people receive a directly useful productive resource, and could reduce their dependence on cash transfers once they become rich enough from using AI productively.
This scenario assumes it’s possible for superintelligence to remain an assistant, on guardrails, to each individual.
Is anyone working on experiments that could disambiguate whether LLMs talk about consciousness because of introspection vs. “parroting of training data”? Maybe some scrubbing/ablation that would degrade performance or change answer only if introspection was useful?
An optimistic way to frame inner alignment is that gradient descent already hits a very narrow target in goal-space, and we just need one last push.
A pessimistic way to frame inner misalignment is that gradient descent already hits a very narrow target in goal-space, and therefore S-risk could be large.
This community has developed a bunch of good tools for helping resolve disagreements, such as double cruxing. It’s a waste that they haven’t been systematically deployed for the MIRI conversations. Those conversations could have ended up being more productive and we could’ve walked away with a succint and precise understanding about where the disagreements are and why.
If you try to write a reward function, or a loss function, that caputres human values, that seems hopeless.
But if you have some interpretability techniques that let you find human values in some simulacrum of a large language model, maybe that’s less hopeless.
The difference between constructing something and recognizing it, or between proving and checking, or between producing and criticizing, and so on...
I keep seeing absolutely terrible epistemics from like 50% of AI Safety. From people who previously seemed reasonable. This quick take was prompted by an example I just saw, from Connor Leahy: https://x.com/JoshWalkos/status/2021087240126976511
Accidental AI Safety experiment by PewDiePie: He created his own self-hosted council of 8 AIs to answer questions. They voted and picked the best answer. He noticed they were always picking the same two AIs, so he discarded the others, made the process of discarding/replacing automatic, and told the AIs about it. The AIs started talking about this “sick game” and scheming to prevent that. This is the video with the timestamp:
From the AI’s messages seen in the video it’s possible that maybe he provided those instruction as user prompt instead of a system prompt. I wonder if the same thing would’ve happened if they were given as the system prompt instead.
This experiment is pretty clever no? I don’t think a total AI amateur would discover it, either he’s been following along this problem for quite some time or he read about this somewhere recently or one of us AI safety nerds sponsored him. P=not sure though, it’s not beyond what people with an investigative mindset might come up with.
He mentions he’s just learned coding so I guess he had the AI build the scaffolding. But the experiment itself seems like a pretty natural idea, he literally likens it to a King’s council. I’m sure once you have the concept having an LLM code it is no big deal.
Scott Alexander left an important reply to Rob Bensinger on X. I happen to agree with Scott. Here’s the original post by Rob:
The reply by Scott Alexander:
I think that both of these posts seem very confused about the dynamics of who says or thinks what, and I’m pretty sad about these posts.
Thoughts on Rob’s post
In general, I’ll note that I don’t think Rob really knows many of the OP people; I suspect he has spent <40 hours talking to them about any of this possibly ever. (This is in contrast to e.g. Habryka.) I don’t know where he’s getting his ideas about what the OP people think, but he seems incredibly confused and ignorant. (Eliezer seems similarly ignorant about who believes what.)
I don’t really think this is true
I wish Rob would be clear who he was referring to. Dario has beliefs that seem to me very different from most people who worked on the 2022 AI misalignment risk efforts at Open Phil. (I’m thinking of people like Holden Karnofsky, Ajeya Cotra, Joe Carlsmith, Lukas Finnveden, Tom Davidson. I’ll refer to this as “OP AI people” despite the fact that none of them work at Coefficient Giving (which OP renamed to).) Maybe Rob is talking about what Alexander Berger thinks?
I think both Dario and Open Phil staff have been reasonably honest about their beliefs about catastrophic misalignment risk publicly, I think that Dario genuinely thinks it’s <5% and the OP AI people generally think it’s higher. (Tbc I think Dario’s take here is very bad!)
This is a reasonable statement of (a simple version of) the Dario/Jared/Anthropic position, but not the OP AI person position. The OP AI people were worried about AI misalignment and ASI enough to try to think it through in detail starting many years ago!
This is not what the OP people think, e.g. see 1 2 3. It’s a reasonable description of what Dario/Jared say.
This is not what the OP people think. I think it’s somewhat reasonable to accuse Anthropic of this.
I’ve never felt any pressure to play down my concerns from the OP people. For example, I’ve been in a lot of discussions about whether it’s better for MIRI to be more or less powerful or influential. To me, the main argument that it’s bad for MIRI to be more influential isn’t that MIRI is making a mistake by openly saying that risk is high. It’s that MIRI has beliefs about x-risk that are wrong on the merits which lead them to making unpersuasive arguments and bad recommendations, and they’re in some ways incompetent at communicating.
And I think this is not very representative of what Ant thinks. E.g. they don’t really think of themselves as coordinating with other AI-safety-concerned people.
This is somewhere between “strawman” and “just totally confused as a description of what people believe”
Basically everything else in Rob’s post seems like a strawman.
Overall, I think this post is extremely confused, and Rob should be ashamed of writing such incredibly strawmanned things about what someone else thinks.
I recommend that people place very little trust in claims Rob makes about what other people believe. As someone who knows and talks regularly to the “Open Phil AI people”, I seriously think that Rob has no idea what he’s talking about when he ascribes arguments to them.
I guess there’s the question of what we are supposed to do if, in fact, the OP people agree with Rob’s version of their position but publicly deny that—at that point we’d have to do some brutal adjudication based on confusing private evidence or inferences from public actions and statements. I really don’t think that looking into that evidence would support Rob’s claims.
Thoughts on Scott’s post
I don’t really think of Rob or MIRI as having a comms strategy of undermining EAs. I think Rob and Eliezer just say a bunch of false, wrong things about EAs because they’re mad at them for reasons downstream of the EAs not agreeing with Eliezer as much as Eliezer and Rob think would be reasonable, and a few other things.
Some EAs engage in equivocation and shyness about their beliefs; OP AI people less than many others.
I think Dario (like various other Anthropic people) does not believe that AI takeover is a very plausible outcome, and I think his position is indefensible on the merits, as are some of his other AI positions (e.g. his skepticism that there are substantial returns to intelligence above the human level, his skepticism that ASI could lead to 2x manufacturing capacity per year). He moderately disagrees with the OP people about this.
I don’t totally understand what point Scott is trying to make here, but I think this point is quite unfair.
Agreed
I think Scott is blaming MIRI much too much here. Dario’s main difficulty when arguing that he thinks AI will pose huge catastrophic risk in the next few years is that lots of people think this seems implausible on priors, not because those people were specifically turned off by MIRI making related arguments earlier. His core audience has never heard of MIRI.
I think this is an incorrect read. Some people from PauseAI and MIRI criticize AI safety efforts a lot, often in ways I think are really dumb and counterproductive. But I don’t think they’re doing this as part of a strategy to force people into their strategies; it’s because of some combination of them genuinely (but perhaps foolishly) thinking that the other strategies are bad and/or the people executing them are corrupt.
I disagree in a lot of the claims here about how various aspects of the current situation are good. (E.g. why does he think that Ilya is doing an alignment effort?)
It’s unclear what “you guys” means. I think Pause AI is making a variety of bad strategic choices. I think that knifing other safety advocates is one bad strategic choice, but it’s more like a bad choice that is downstream of my main problems with them, rather than my core concern about them. I think Rob is totally unreasonable and I wish he would stop working on AI safety, but I think he’s much worse than e.g. MIRI is overall. I think MIRI spends very little of their support on knifing AI safety advocates, they spend almost all of it on advocating for people being scared about misalignment risk and advocating for AI pauses (which I am generally in favor of). Eliezer totally does have a hobby of saying ridiculously strawmanny stuff about OP AI people, which I find
prettyextremely annoying, but I don’t think it’s a big part of his effect on the world.----
Overall, both posts seem to have substantially inaccurate pictures of what’s going on and what various actors think.
Thanks for writing this, Buck. I’m not going to try to reply to your whole post, because I think some of it is stuff I should chew on for longer and see whether I agree with it. But going through some of your points:
I definitely apologize for making it sound like I was making a harsher criticism of (the relevant parts of) EA than I intended. My tweet was originally written as a quick follow-up comment to someone who asked why I thought EA’s impact on AI x-risk was only ~55% likely to be positive. I turned it into a top-level tweet because I didn’t want to hide it deep in an existing discussion, but this was an error given I didn’t add extra context.
I also apologize for anything I said that made it sound like I was universally criticizing past or present Open Phil / cG staff (or centrally basing my views on first-hand conversations, for that matter). I already believed that tons of past and present rank-and-file OP/cG staff have very reasonable views, and I happily further update in that direction based on your and Oliver’s statements to that effect (e.g., Ollie’s “I have since updated that more people who are a level below Alexander, Dustin and Dario have more reasonable beliefs”).
I agree that my characterization of “Dario and a cluster of Open-Phil-ish people” was phrased in a needlessly confusing and sloppy way. I wanted to talk about a mix of ‘present-day views that seem to be endorsed by Dario and some other key figures’ and ‘general tendencies and memes that seem pretty widespread and that seem suspiciously related to choices EA leadership made many years ago’, but blurring these together is really unnecessarily confusing. Also, it didn’t help that I was sarcastically embedding my criticisms into my summaries of the views.
Insofar as my broad criticism of EA cultural trends/memes is correct (which I think is substantial), I still feel a fair bit of uncertainty about how to divvy up responsibility between more Open-Phil-ish people, more Oxford-ish people, MIRI / the rats, etc. And of course, some of the problem may stem from broader social-or-demographic factors that no EA leaders tried to engineer, and that even go counter to how leadership has tried to optimize. (I too remember the early speeches themed around “Keep EA Weird”, the early EA-leader conversations fretting about overly naive EA consequentialism, etc.)
Thanks, this is helpful and I basically accept most of what you’re saying. Some more specific comments on the part about me:
I accept this criticism and take back my claim. I noticed that some people who worked for MIRI comms seemed to do this, and I assumed that anything said by enough MIRI comms people in a serious-sounding voice was on some level a MIRI communique. Eliezer has clarified that this isn’t true, so I apologize for saying it was.
I basically agree with this (while wanting to clarify that I think he assigns a pretty high risk to permanent dictatorship or something along those lines) but I think he’s done an okay job of navigating uncertainty, realizing that even a low chance of human extinction is very bad, and being willing to (somewhat) cooperate and collect gains-from-trade with people who are doomier than he is. I see him as living in a consistent worldview next door to our movement’s (sort of like Vitalik or Dean Ball) and I think that, like those two people, he’s potentially somewhere between a friend / an ally-of-convenience / a negotiating partner, potentially convertible into a full ally if future events prove us right, or into a true enemy if we pre-emptively alienate him. Having someone like this in charge of a frontier lab is better than I expected (Demis might also be in this category, but I’m not sure, and worry that Larry and Sergey have final say).
I agree that Dario is slightly being a jerk here, but I think that people have lots of stereotypes of “doomers” which derive from some real behavior of MIRI and PauseAI, and which wouldn’t exist if the median pause AI person was eg the median Constellation person, and I think Dario feels some understandable incentive to distance himself from this.
I have no useful knowledge here, but Ilya seems genuinely alignment-pilled and terrified, the fact that he did the very courageous and self-sacrificing thing of trying to blow up OpenAI to try to get rid of Altman for what were mostly safety-related reasons speaks well of him, and IDK, he’s calling it “safe superintelligence” and saying he won’t release anything at all until he’s sure. I don’t claim any secret expertise in Ilya-ology but overall all of this seems encouraging and I’m surprised this part of my tweet attracted so much dissent.
I mostly accept your criticism that I should narrow my objections from “MIRI & Co” to “Pause.AI, Rob, maybe sort of Eliezer, & a slightly different co”. I don’t really know how to do this or what one word covers all of them without inflicting different forms of collateral damage (I don’t want to say “PauseAIers” because that also covers some people I like, and it feels extra-aggressive to name specific names), but I’m open to suggestion.
I’m generally sympathetic to Scott’s positions in this discussion, but I think he is probably very wrong about Ilya.
To the best of my knowledge, Safe Superintelligence has never published a single word about what they plan to do move alignment forward, which is pretty damning. in my opinion.
I have not heard of anyone who is known to be thoughtful about AI safety to have been hired to SSI, and I have not seen any position being advertised to AI safety people. People should correct me if I missed someone good joining SSI, but I think this is also a very bad sign.
My impression is that people who worked with Ilya at OpenAI don’t remember him as being particularly thoughtful about alignment, e.g. much less so than Jan Leike. This is a low confidence, third-hand impression, people can correct me if I’m wrong.
My impression is that the available evidence suggests that Ilya mostly took part in Altman’s firing for (perhaps justified) office politics grievances, and not primarily due to safety concerns. I also think that evidence points to his behavior during and after the incident being kind of cowardly. (I haven’t looked deeply into the details of the battle of the board, and it’s possible I’m wrong on this point, in which case I apologize to Ilya.) I’m also doubtful of how self-sacrificing think actions were—my best guess is that his current net worth is higher (at least on paper) than it would be if he stayed at OpenAI.
I expect that at some point SSI’s investors will grow impatient, and then SSI will start coming out with AI products (perhaps open-source to be cooler), just like everyone else. I don’t expect them to contribute too much to safety, though maybe Ilya will sometimes make some noises about the importance of safety in public speeches, which is nice I guess.
I’m pretty confident in my first two points, much less so in the next two, but I felt someone should respond to Scott on this point. Perhaps @Buck or someone else who expressed skepticism of Ilya’s project can add more information.
I think you are overfitting Rob’s post to be about the wrong people. I think it’s much closer to accurate, if you actually read what he says, which is:
I think the things Rob is saying still have some strawman-y nature to them, but I think they are reasonably accurate descriptors of Anthropic leadership, plus my best guesses of what Alexander (head of Coefficient Giving) and Zach (head of CEA) believe, which seems well-described by “Dario and a cluster of Open-Phil-ish people”, and furthermore also of course constitutes an enormous fraction of the authority over broader EA.
I feel like almost all of your comment is just running with that misunderstanding and hence mostly irrelevant.
As you say yourself, almost no one in your list works at cG, or is in any meaningful position of authority at cG, so this feels like a bit of an absurd interpretation (I think trying to apply the things he is saying to Holden is reasonable, given Holden’s historical role in cG, and I do think he in the distant past said things much closer to this, but seems to have changed tack sometime in the past few years).
A lot of Rob’s complaints are about things that happened in the past, so I don’t think it’s crazy to interpret him as talking about people who worked at CG in the past.
I think that these people believe different things, and I don’t think Rob’s post particularly accurately describes any of them. For example, the Anthropic leadership doesn’t really think of themselves as trying to coordinate with AI safety people or trying to suppress them. I don’t think Alexander thinks “AI is going to become vastly superhuman in the near future” (and fwiw I don’t think Dario thinks that either, he doesn’t seem to believe in returns to intelligence substantially above human-level).
(sending quickly, I might be wrong)
Fair enough. I think that the people you list also used to believe things closer to what Rob is saying in the past, so at least we need to do a consistent comparison. Holden from 10 years ago seems to say a lot of the things that Rob is saying here, and Ajeya from a few years ago also said things more like this (more point 1 and 3, less point 2).
My guess is that it is worth digging up quotes here, but it’s a lot of work, so I am not going to do it for now, but if it turns out to be cruxy, I can.
(Again, I don’t think these are centrally the people Rob is talking about in either case. I think centrally he is talking about Anthropic, and then secondarily talking about how Open Phil people have related to Anthropic over the years, but I do still think his criticism is correct directionally for those people)
I think Alexander abstractly believes that AI could very well become vastly superhuman in the near future, but yes, similar to Dario does not believe that speculating about such a thing in a non-scientific non-empirical way is appropriate, and as such they do not have coherent beliefs about this. Indeed, it seems like really a quite central match to what Rob is saying.
I don’t remember anything like this. I think it might be misremembered or a strained interpretation.
Here are points 1 and 3 for reference:
I asked ChatGPT to read bioanchors (where I thought this was most likely to occur), and then to read all of her other writings looking for anything that fits that mode. Here’s its reply, not finding anything.
The closest match it finds is that Ajeya often caveats her claims. For example from bio anchors:
I don’t think this matches points 1 or 3 well.
Huh, I am a bit confused about you summarizing that ChatGPT response that way. Maybe we are talking past each other, but Robby’s statements are not intended as the kind of statement that passes people’s ITT (which IMO is fine, frequently summaries of other people’s views should not pass their ITT, though it should ideally be caveated when this is going on).
Despite that, your ChatGPT transcript says:
I am not expecting any direct endorsements of these statements (which are phrased as to make their internal contradictions most obvious), so this ChatGPT response seems compatible with what I am saying?
When I asked ChatGPT to “rephrase these two beliefs in more neutral language that would make more sense for someone to endorse (but try to pretty tightly imply the above)” it gave these two:
When I asked ChatGPT about this framing, it said:
But also, when we are in the domain of “evaluate whether Ajeya said things that imply the things above and result in other people getting the same vibe as the above”, then ChatGPT and Claude seem like much worse judges, so I think this question becomes more difficult to answer and I wouldn’t super defer to the language models (and is part of why I expected it would take a while to dig up quotes and do the work and stuff).
(If you want to complain that Robby should have caveated his stuff more as not being the kind of thing that passes people’s ITT, then I am happy to argue about that. I think a better post would have done it, but it’s not something I think is always necessary to do.)
(Also just for the sake of completeness, I don’t get this vibe from Ajeya at all these days and have no complaints on this front, besides probably still some strategic disagreement on stuff around point 3, but like at the level that I have with many people I respect almost certainly including you)
When you wrote:
I interpreted you as claiming that Ajeya had said “things more like:”
I don’t recall any examples of Ajeya saying or implying anything at all like that. I asked ChatGPT to try to find examples and I think it didn’t find anything.
In your ChatGPT session, a typical example it cites is:
I think those examples don’t meaningfully support the original claim, at least as a typical reader would understand it.
I have no interest in defending ChatGPT’s claims here, and feel like I caveated that pretty explicitly. I agree that quote is largely irrelevant.
Yep, I agree with you that ChatGPT did not find any clear quotes (though it doesn’t look like ChatGPT tried very hard to find quotes). I disagree that it didn’t find “anything at all like that” (indeed ChatGPT is quite explicit that it found some things “kind of like that”).
I do. As I said, I could go and dig them up but it would take quite a while, and I am only like 75% confident they are written up as opposed to conversations, or private Google Docs or something that I would have trouble finding. It was a strong vibe I got at the time and I remember having a few conversations about adjacent conversations either with Ajeya or being about Ajeya.
Let me know if you want me to do this. I don’t quite know what’s at stake here for you, and I feel somewhat like we are talking past each other and before I do that it would be more productive to go up some meta-level, but I am not quite sure.
I think you’re right, and also it seems misleading / like a bad clustering to lump “the EAs” in with “Anthropic’s leadership”. I think those groups have some memetic connections, but they’re not the same group!
I feel like it’s more of a reasonable carving to lump in OpenPhil with “the EAs”, since they were/are effectively EA thought-leaders and they exerted a lot of influence, directly and indirectly.)
More than 50% of the talent-weighted safety people in EA are literally employees of Anthropic! The ex-CEO of Open Phil now works at Anthropic, and is married to one of its founders. These groups have enormous overlap.
Like, there is so enormous overlap, and the overlap results in such an enormous amount of de-facto deference (being an employee of a company is approximately the strongest common deference relationship we have) that it makes sense to think of these as closely intertwined.
Yes, there are people who attach the EA label themselves who are different here, sometimes even quite substantial clusters. But it’s also IMO clear from Scott’s response that he himself is also majorly deferring and is majorly supportive of Anthropic as a representative of EA, so this clearly isn’t just a split between “everyone who works at Anthropic and everyone who doesn’t”.
Rob used “Open Phil” exactly two times. One time saying “a cluster of Dario and Open-Phil-ish people” and another time “EAs / Open Phil” in reference to the broader community that includes all of these things. These seem like totally reasonable ways of using these pointers and words. I don’t have anything better. It’s definitely not “just Anthropic” as I think Scott very unambiguously demonstrates, and it would be of course extremely confusing to refer to Scott as “Anthropic”.
Imagine re Open Phil and hardcore rationalists “the ex-CEO of MIRI now works at Open Phil, and and the CEO of Lightcone is dating an Open Phil employee. These groups have enormous overlap.”
Yes. People can have a lot of social overlap, yet have very different views from one another, especially in the broader Bay Area intellectual ecosystem. My sense is that Anthropic leadership has very different views from most AI safety EAs.
Why do you think this? I’m skeptical this is true, especially if you’re including non-technical talent.
IDK, I counted them? I made some spreadsheets over the years, and ran this number by a bunch of other people, and my current guess is that it’s around 55%? When I list organizations with full-time employees working in safety I actually end up at substantially above 50% of people working at Anthropic, but I think that’s overcounting.
I think there are differences and overlaps. I think Rob points to a thing that is shared across a cluster that spans both of them, and has historically had a lot of influence.
But aren’t Alexander Berger’s views not very relevant about OpenPhil’s AI strategy decisions from many years ago when their AI strategy and worldview—which I take to be very cose to the things Rob was criticizing—were worked out and started shaping the views of EAs in OpenPhil’s orbit?
Even now, when people criticize things OpenPhil has done in the past in the AI landscape, or criticize their general worldview and takes on AI risk (as it was developed in influential pieces of writing), I am by default automatically viewing it as criticism of Holden, Ajeya Cotra, Tom Davidson, Joe Carlsmith, etc. If people don’t intend me to interpret them that way, please be more clear. 🙂
I’m aware that, separately, OpenPhil/Coefficient Giving has undergone quite a transition and that you clashed badly with Dustin M. I think that’s very sad and unfortunate, but I think of these as quite distinct things and I never assumed that the thing with Dustin M. had anything to do with OpenPhil’s AI strategy decisions in (say) five years ago (edit: sorry that sounds like a strawman, but I mean something like “I’m not sure the same cause explains why some people who were at OpenPhil in the past found MIRI epistemically off-putting, and why Dustin M finds the rationalists to be a reputation risk & thinks reputation risks are unusually bad compared to other bad things.”) I could be wrong, of course, and maybe you think the org has a general thing of them of valuing “reputability” and “playing politics” too much. I just want to note that it’s not obvious how much these things are connected/caused by one “OpenPhil culture,” vs being about distinct things. (I think some of these are maybe directionally accurate as criticism, btw.)
I’m sure this is obvious to everyone involved, but I also just want to point out that when a lot of senior people leave, organizations can change really a lot, so it would be weird to speak of OpenPhil/Coefficient Giving now as though it were obviously still the same entity/culture.
I think Holden at the time believed something closer to what Rob says here (though it’s still not an amazing fit), and more generally, I think “the beliefs of the successor CEO” are actually a better proxy for “the vibes of the broader ecosystem you are part of” than “the beliefs of the founder CEO”. I could go into more detail on my beliefs on this, though I think the argument is reasonably intuitive.
Yep, I think they are highly related. Indeed, I was predicting things like the Dustin thing without any knowledge of Dustin’s specific beliefs, and my predictions were primarily downstream of seeing how Anthropic’s position within the ecosystem was changing, and a broader belief-system that I think is shared by many people in leadership, not just Dustin.
I have since updated that more people who are a level below Alexander, Dustin and Dario have more reasonable beliefs, but also updated that those things end up mattering surprisingly little for what actually ends up a strategic priority.
I think the “OpenPhil culture” thing is a distraction. In my model of the world most of this is downstream of people being into power-seeking strategies mostly from a naive-consequentialist lens, which is not that unique to OpenPhil within EA (and if anything OpenPhil has some of the people with the best antibodies to this, though also a lot of people who think very centrally along these lines, more concentrated among current leadership).
What do you mean by this?
I think some of the people who are best at thinking independently about stuff, and are pretty good at not getting swept up in the power-seeking stuff, work at Open Phil. I think Holden genuinely helped with some of the correct cultural pieces, and my current belief is that if he wasn’t under the most pressure that anyone is, that he would probably have a relatively sane relationship to Anthropic as a result of it, though I am not as confident I am about that as I am that he had a bunch of quite good cultural pieces that help people be less naively power-seeking here.
Pause AI is a global activist movement with many chapters, and members with a mix of opinions, with some voices louder than others.
I’ve been volunteering there full time for a couple of years. I’m someone who cares a lot about partnership and the ecosystem of AIXR organizations. (I reckon not being killed by superintelligence is helped by pursuing a portfolio of bets that are mostly disjunctive and individually low odds.)
Buck, I would be really interested to hear more about your concrete concerns with Pause AI. By all means link a previous account if one exists.
Happy to discuss in public or private.
Honestly, this is such a bad reply by Scott that I… don’t quite know whether I want to work on all of this anymore.
If this is how this ecosystem wants to treat people trying their hardest to communicate openly about the risks, and who are trying to somehow make sense of the real adversarial pressures they are facing, then I don’t think I want anything to do with it.
I have issues with Rob’s top-level tweet. I think it gets some things wrong, but it points at a real dynamic. It’s kind of strawman-y about things, and this makes some of Scott’s reaction more understandable, but his response overall seems enormously disproportionate.
Scott’s response is extremely emblematic of what I’ve experienced in the space. Simultaneous extreme insults and obviously bad faith arguments (“actually, it’s your fault that Deepmind was founded because you weren’t careful enough with your comms”), and then gaslighting that no one faces any censure for being open about these things (despite the very thing you are reading being extremely aggro about the lack of strategic communication), and actually we should be happy that Ilya started another ASI lab, and that Jan Leike has some compute budget.
The whole “no you are actually responsible for Deepmind” thing, in a tweet defending that it’s great that all of our resources are going into Anthropic, is just totally absurd. I don’t know what is going on with Scott here, but this is clearly not a high-quality response.
Copying my replies from Twitter, but I am also seriously considering making this my last day. It’s not the kind of decision to be made at 5AM in the morning so who knows, but seriously, fuck this.
IMO this doesn’t seem like the kind of response you will endorse in a few days, especially the “You are responsible for Deepmind/OpenAI” part.
You were also talking about AI close to the same time, and you’ve historically been pretty principled about this kind of stance.
Robby at least has been very consistent on this that he is against most forms of strategic communication in general.
I also think you are against many forms of strategic communication in general? Your writing explores many of the relevant considerations in a lot of depth, and you certainly have not shied away from sharing your opinion on controversial issues, even when it wasn’t super clear how that is going to help things.
I think you are just arguing the wrong side of this specific argument branch. My model of Eliezer, Nate and Robby all have been pretty consistent that being overly strategic in conversation usually backfires. Of course you shouldn’t have no strategy, and my model of Eliezer in-particular has been in the past too strategic for my tastes and so might disagree with this, but I am pretty confident Robby himself is just pretty solidly on the “it’s good to blurt out what you believe, *especially* if you don’t have any good confident inside view model about how to make things better”.
I feel like we both know this is a strawman. The key thing at least in recent years that Rob, Eliezer and Nate have been arguing for is the political machinery necessary to actually control how fast you are building ASI, and the ability to stop for many years at a time, and to only proceed when risks actually seem handled.
If anything, Eliezer, Nate and Robby have been actively trying to move political will from “a pause right now” to “the machinery for a genuine stop”.
This makes this comparison just weird. Yes, according to everyone’s models the only time you might have the political will to stop will be in the future. I have never seen Nate or Eliezer or Robby say that they expect to get a stop tomorrow. But they of course also know that getting in a position to stop takes a long time, and the right time to get started on that work was yesterday.
So if they had their way (with their present selves teleported back in time) is that we would have more draft treaties, more negotiation between the U.S. and China. More materials ready to hand congress people who are trying to grapple with all of this stuff. Essays and books and movies and videos explaining the AI existential risk case straightforwardly to every audience imaginable.
That is what you could do if you took the 200+ risk-concerned people who ended up instead going to work at Anthropic, or ended up trying to play various inside-game politics things at OpenAI.
And man, I don’t know, but that just seems like a much better world. Maybe you disagree, which is fine, but please don’t create a strawman where Robby or Nate or Eliezer were ever really centrally angling for a short-termed pause that would have already passed by-then.
And then even beyond that, I think if you don’t know how to solve a problem, I think it is generally the virtuous thing to help other people get more surface area on solving it. Buying more time is the best way to do that, especially buying time now when the risks are pretty intuitive. I think you believe this too, and I don’t really know what’s going with your reaction here.
Come on man, a huge number of people we both respect have recently updated that the kind of direct advocacy that MIRI has been doing has been massively under-invested in. I do not think that “other people are executing this portfolio plan admirably”, and this is just such a huge mischaracterization of the dynamics of this situation that I don’t know where to start.
“If Anyone Builds It, Everyone Dies” is a straightforward book. It doesn’t try to sabotage every other strategy in the portfolio, and I have no idea how you could characterize really any of the media appearances of Nate this way.
This is of course in contrast to Open Phil defunding almost everyone who has been pursuing this strategy and making mine and tons of other people’s lives hell, and all kinds of complicated adversarial shit that I’ve been having to deal with for years, where absolutely there have been tons of attempts to sabotage people trying to pursue strategies like this.
Like man, we can maybe argue about the magnitude of the errors here, and the sabotage or whatever, but trying to characterize this as some kind of “Nate, Eliezer, Robby are defecting on other people trying to be purely cooperative” seems absurd to me. I am really confused what is going on here.
I am sympathetic to the first of these (but disagree you are characterizing Dario here correctly).
But come on, clearly Ilya sitting on $50 billion for starting another ASI company is not good news for the world. I don’t think you believe that this is actually a real ray of hope.
(And then I also don’t think that Jan Leike having marginally more compute is going to help, but maybe there is a more real disagreement here)
Overall, I am so so so tired of the gaslighting here.
Please don’t quit, Oliver.
Unless you mean “making this my last day [on twitter]”, which might or might not be a good idea.
I don’t think Scott speaks for the ecosystem. He’s just a guy in it, and one who isn’t even that closely connected to Anthropic or Coefficient Giving people. (E.g. you spend >10x as much time talking to people from those orgs as he does.) I think that the people in the ecosystem you’re criticizing would not approve of Scott’s post.
I think this is not a good summary of what Coefficient Giving has done. (I do think it really sucks that they defunded Lightcone.)
I think this is false. I expect Scott’s post to be heavily upvoted, if it was posted to the EA Forum to have an enormously positive agree/disagree ratio, and in-general for people to believe something pretty close to it.
There are a few exceptions (somewhat ironically a good chunk of the cG AI-risk people), but they would be relatively sparse. I think this is roughly what someone who is smart, but doesn’t have a strong inside-view take about what they should do about AI-risk believes that they should act like if they want to be a good member of the EA community. My guess is it’s also pretty close to what leadership at cG, CEA and Anthropic believe, plus it would poll pretty well at a thing like SES.
The issue is of course not that Scott is right or wrong about what Anthropic or cG people believe. The issue is that he seems to be taking a view where you should be super strategic in your communications, sneer at anyone who is open about things, and measure your success in how many of your friends are now at the levers of power.
I think cG’s funding decisions were really very centrally about trying to punish people who weren’t being strategic in their communications in the way that Dustin wanted them to be strategic in their communication’s.
I think other “all kinds of complicated adversarial shit” has also happened, though it’s harder to point to. At a minimum I will point to the fact that invitation decisions to things like SES have followed similar adversarial “you aren’t cooperating with our strategic communications” principles.
The EA Forum is a trash fire, so who knows what would happen if this was published there.
My read of the social dynamics is that in places where people are inclined to defer to me or people like me, they might initially approve of the Scott thing for bad tribal reasons, but change their mind when they read criticism of it from me or someone like me (which is ofc part of why I sometimes bother commenting on things like this).
I think that Scott’s post would not overall be received positively by those people. Maybe you’re saying that one of the directions argued for by Scott’s post is approved of by those people? I agree with that more.
Well, I mean, that is a hard conditional to be false since if people were to not change their mind, this would largely invalidate the premise that they are declined to defer to you. Unfortunately, I both think the vast majority of places in EA do not defer to you or people like you, and furthermore, I also think you are pretty importantly wrong about your criticisms, so I don’t quite know how to feel about this.
I do think it helps and am marginally happy about your cultural influence here (though it’s tricky, I also think a bunch of your takes here are quite dumb). I think the vast majority of the cultural influence here is downstream of not quite anyone in-particular, but more Anthropic than anywhere else, and neither you nor me can change that very much.
Yeah, I expect it to be straightforwardly positively received. I think people will be like “some parts of this seem dumb, the Ilya thing in-particular, but yeah, fuck those rationalists and MIRI people, I am with Scott on that”.
To be clear, I am not expecting consensus here, I think this will be what 75% of people who have any opinion at all on anything adjacent on this believe, but I expect people would broadly think it’s a good contribution that properly establishes norms and reflects how they think about things.
I also think it’s plausible people would be like “wow, what an uncough way that both of these people are interfacing with each other, please get away from each other children”, but then actually if you talked to them afterwards, they would be like “yeah, I mean, that was a bit of a shitshow but I do think Scott was basically right here (minus 1-2 minor things)”.
I am not enormously confident on this, but it matches my experiences of the space.
In case it matters to either of you, my guesses:
I agree with Habryka that absent criticism Scott’s post would be well received by an important group of people reasonably characterized as EA-ish AI safety people.
Imo absent criticism Rob’s post would be well received by a different group of people reasonably characterized as doomers. (Literally right before seeing this thread I saw another post on LW that is directionally correct but is mostly wrong or exaggerated in its details, and that was very well received.)
Both posts are broadly wrong about lots of things, about equally so, such that most people would be better off having never encountered either of them.
Tbc, my first-order intuitive impression is that Scott’s post is much more directionally accurate. But I expect that is because I constantly experience people knifing me, pushing me to take strategies that systematically destroy my ability to do anything while gaining approximately no safety benefit, or making claims about members of groups that include me that are false of me, whereas I don’t really experience any of the stuff that Rob gestures at, even though I expect it exists. Though Rob’s post doesn’t actually inform me of it, because his actual claims are false, and I cannot infer the underlying experiences that led him to make them. Another example of trapped priors if you don’t have second order corrections. (Tbc his follow-up post makes this substantially clearer.)
You probably already know I think this, but imo you should both quit on making public discourse in the AI safety community non-insane, and do other things that have a shot at working. (Since I know this will be misinterpreted by other readers, let me be clear that there are plenty of other kinds of public writing that do not fall in that bucket which I do think are worth doing.)
I endorse you taking the space to figure out how you want to relate and doing what’s right for you, I’ve increasingly updated to thinking that people doing things they’re not wholeheartedly behind tends to be net bad in all sorts of sideways ways, but the effort would be weaker for your loss. Wherever you end up, I appreciate you having taken the strategy of speaking in public about things that usually aren’t in a way that helped clarify the strategic situation for me many times.
(also, it’s scary to see three of the people I’d put in the upper tiers of good communication and understanding where we’re at with AI technically get into this intense conflict. I’m going to be thinking on this some and seeing if anything crystalizes which might help specifically, but in the meantime a few more general-purpose posts that might be useful memes for minimizing unhelpful conflict are A Principled Cartoon Guide to NVC, NVC as Variable Scoping, and Why Control Creates Conflict, and When to Open Instead)
Locally trying to clear up one misunderstanding.
I think Scott’s “couple more years” wasn’t referring to a belief that EA could have successfully advocated for a couple of year pause, but rather referring to the change in timeline you’d have gotten if safety-sympathetic people refused to work on stuff that increases the pace of capabilities progress.
Oh, I see. That makes sense, I agree I misunderstood this part to be about something else (though I disagree similarly strongly with the correct interpretation, but it’s still good to clear that up).
I really don’t think Scott is gaslighting you. I think Scott is being honest here, but you should model him as having somewhat snapped. Pause AI and MIRI-adjacent people on X have been extremely adversarial and have been contributing to very bad discourse (even arguments-wise). I think Scott saw Rob’s post as very strawmannish and needlessly adversarial, and he more or less correctly lumped it in with this rising tide of terribleness, even if MIRI itself is definitely not as guilty. I might well be wrong about the specifics, but Scott Alexander isn’t the kind of person who tends to gaslight.
I think you need to be a lot more deflationary about the g-word. If you think, “But ‘gaslighting’ is something Bad people do; Scott Alexander isn’t Bad, so he would never do that”, well, that might be true depending on what you mean by the g-word. But if the behavior Habryka is trying to point to with the word to is more like, “Scott is adopting a self-serving narrative that minimizes wrongdoing by his allies and inflates wrongdoing by his rivals” (which is something someone might do without being Bad due to having “somewhat snapped”), well, why wouldn’t the rivals reach for the g-word in their defense? What is the difference, from their perspective?
“Gaslighting” should probably be avoided because it is anywhere between meaningless and a fighting word depending on who says it and how.
The g-word is a very nasty accusation. It gets thrown around and means a bunch of stuff down to just “saying stuff I disagree with”, but it shouldn’t.
It is originally a conscious, malicious attempt to drive someone insane by strategically lying to them.
On the substance, people are honest but wrong an awful lot, and honest but massively overstating their case even more often. Assuming your rivals are malicious or dishonest when they’re just wrong or overstating is a huge source of conflict and thereby confusion.
It’s a really useful pointer towards a tactic that is relatively widespread and has no better word. I am personally happy to use other words, but I have the sense that sentences like “I am so very very tired of the ambiguous but ultimately strategic enough attempts at undermining my ability to orient in this situation by denying pretty clearly true parts of reality combined with intense implicit threats of consequences if I indicate I believe the wrong thing that might or might not be conscious optimizations happening in my interlocutors but have enough long-term coherence to be extremely unlikely to be the cause of random misunderstandings” would work that well.
Yeah I would call that “gaslighting”. It looks like my initial interpretation of what you meant by it is closer than Zack’s. I think Scott isn’t doing that. I’m inclined to believe you when you say other people have behaved this way.
Everything makes sense when you meditate on how the line between “cooperation” and “defection” isn’t in the territory; it’s a computed concept that agents in a variable-sum game have every incentive to “disagree” (actually, fight) about.
Consider the Nash demand game. Two players name a number between 0 and 100. If the sum is less than or equal to 100, you get the number you named as a percentage of the pie; if the sum exceeds 100, the pie is destroyed. There’s no unique Nash equilibrium. It’s stable if Player 1 says 50 and Player 2 says 50, but it’s also stable if Player 1 says 35 and Player 2 says 65 (or generally n and 100 − n, respectively).
The secret is that there are no natural units of pie (or, equivalently, how much pie everyone “deserves”). Everyone thinks that they’re being “cooperative” and that their partners are “defecting”, because they’re counting the pie differently: Player 1 thinks their slice is 35%, but Player 2 thinks the same physical slice is 65%.
If you don’t think your partner is treating you fairly, your leverage is to threaten to destroy surplus unless they treat you better. That’s what Alexander is doing when he says, “I would like to support it with praxis, but right now I feel very conflicted about this”. He’s saying, “You’d better give me a bigger slice, Player 1, or I’ll destroy some of the pie.”
That’s also what your brain is doing when you say you don’t want to work on this anymore. Scott doesn’t want you to quit! (Partially because he values Lightcone’s work, and partially because it would look bad for him if you can publicly blame your burnout on him.) Crucially, your brain knows this. By threatening to quit in frustration, you can probably get Scott to apologize and give your arguments a fairer hearing, whereas in the absence of the threat, he has every incentive to keep being motivatedly dumb from your perspective.
You have a strong hand here! The only risk is if your counterparties don’t think you’d ever actually quit and start calling your bluff. In this case, we know Scott is a pushover and will almost certainly fold. But if you ever face stronger-willed counterparties, you might need to shore up the credibility of your threat: conspicuously going on vacation for a week to think it over will get taken more seriously than an “I don’t know if I want to do this anymore” comment.
(Sorry, maybe you already knew all that, but weren’t articulating it because it’s not part of the game? I don’t think I’m worsening your position that much by saying it out loud; we know that Scott knows this stuff.)
Man, I really wish this was the case, and it’s non-zero of what is going on, but the vast majority of what I am expressing with my (genuine) desire to quit is the stress and frustration associated with the gaslighting, which is one level more abstract than the issue you talk about.
Like yes, there is a threat here being like “for fuck’s sake, stop gaslighting or I am genuinely going to blow up my part of the pie”, but it’s not actually about the object level, and I don’t actually have much of any genuine hope of that working in the same way one might expect from a negotiation tactic.
I am just genuinely actually very tired, and Scott changing his mind on this and going “oh yeah, actually you are right” actually wouldn’t do much to make me want to not quit, because it wouldn’t address the continuous gaslighting where every time anyone tries to talk about any of the adversarial dynamics, they immediately get told this is all made up and get repeated “I haven’t seen EAs (other than SBF) do a lot of lying, equivocating, or even being particularly shy about their beliefs” and “everyone is being honest all the time and actually it’s just you who is lying right now and always”.
Yeah, the frustrating part is almost always on a meta level. I think Zack’s point about “No natural units of pie” applies to the gaslighting issue as well though. Asserting one’s viewpoint means asserting it as truth which invalidates differing perspectives. “I disagree, you contradict, he gaslights”.
It’s difficult because sometimes the gas lights really don’t seem to be dimming, and sometimes that perception is downstream of some motivated thinking because I really don’t want to believe we’re running out of oil already, dammit. And so the result is simultaneously kinda an honest statement of perspective (at least, as honest as these tend to get) while also being a (not-necessarily-consciously) motivated action pushing people to disregard their own senses. And then we have to decide how to judge this mess of bias and honesty, and if we don’t judge such that the product after a round trip of perceiving C/D and responding accordingly we get more C than last time… shit’s fucked. And without objective units of pie that people can agree on when judging who was in the wrong.
So like… am I trying to gaslight people into questioning their own sanity so they accept what I want them to accept, or am I just flinching away from what scares me, like we all do? Both, and the question of whether I deserve the leniency and empathy is a difficult one, because what are the units of this pie and where’s the objective cutoff? And because our tolerance for further bullshit tends to diminish after accumulating bullshit, so it gets even more difficult to get back to the other side of criticality.
“It is not the critic who counts: not the man who points out how the strong man stumbles or where the doer of deeds could have done better. The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood, who strives valiantly, who errs and comes up short again and again, because there is no effort without error or shortcoming, but who knows the great enthusiasms, the great devotions, who spends himself for a worthy cause; who, at the best, knows, in the end, the triumph of high achievement, and who, at the worst, if he fails, at least he fails while daring greatly, so that his place shall never be with those cold and timid souls who knew neither victory nor defeat.”
Theodore Roosevelt”Citizenship in a Republic,”Speech at the Sorbonne, Paris, April 23, 1910
I wrote a reply to Scott on Twitter, before seeing the discussion here; I think it’s a lot clearer than my original (IMO sloppy) tweet.
I’ve copied the reply below; see also my reply to Buck.
_____________________________________________________
To clarify the claim I’m making: I’m not trying to throw EA under a bus. This thread spun off from a discussion where I said I thought EA’s net impact on AI x-risk was probably positive, but I was highly uncertain.
Somebody asked what the bad components of EA’s impact were, and I went off on Anthropic, and on EA’s (and especially OpenPhil’s) entanglement with the company and their support for Anthropic’s operations. (To the extent that a lot of x-risk-adjacent EA seems to function, in practice, as a talent pipeline for Anthropic.)
I also said that I think OpenPhil’s bet on OpenAI was a disaster. And I said that there’s a culture of caginess, soft-pedaling, and trying-to-sound-reassuringly-mundane that I think has damaged AI risk discourse a fair amount, and that various people in and around OpenPhil have contributed to.
I’m restating this partly to be clear about what my exact claims are. E.g., I’m not claiming that items 1+2+3 are things OpenPhil and Anthropic leadership would happily endorse as stated. I deliberately phrased them in ways that highlight what I see as the flaws in these views and memes, in the hope that this could help wake up some people in and around OpenPhil+Anthropic to the road they’re walking.
This may have been the wrong conversational tack, but my vague sense is that there have been a lot of milder conversations about these topics over the years, and they don’t seem to have produced a serious reckoning, retrospective, or course change of the kind I would have expected.
I hoped it was obvious from the phrasing that 1-3 were attempting to embed the obvious critiques into the view summary, rather than attempting to phrase things in a way that would make the proponent go “Hell yeah, I love that view, what a great view it is!” If this confused anyone, I apologize for that.
I wasn’t centrally thinking of Holden’s public communication in the OP, though I think if he were consistently solid at this, Aysja Johnson wouldn’t have needed to write this in response to Holden’s defense of Anthropic ditching its core safety commitments.
I feel like this is a case in point. Like, sure, counting up from 0 (“the average corporation building the average product doesn’t try to warn the public about their product, except in ways mandated by law!”), Anthropic’s doing great. Or if the baseline is “is Anthropic doing better than pathological liar Sam Altman?”, then sure, Anthropic is doing better than OpenAI on candor.
If we’re instead anchoring to “trying to build a product that massively endangers everyone in the world is an incredibly evil sort of thing to do by default, and to even begin to justify it you need to be doing a truly excellent job of raising the loudest possible alarm bells alongside dozens of other things”, then I don’t think Anthropic is coming close to clearing that bar.
“Things go really, really badly”? Nobody outside the x-risk ecosystem has any idea what that means. And this is not the kind of claim Anthropic or Dario has ever tried to spotlight. You won’t find a big urgent-looking banner on the front page of Anthropic loudly warning the public, in plain terms, about this technology, and asking them to write their congressman about it. You won’t even find it tucked away in a press release somewhere. Dario gave a number when explicitly asked, in an on-stage interview.
If we’re setting the bar at 0, then maybe we want to call this an amazing act of courage, when he could have ducked the question entirely. But why on earth would we set the bar at 0? Is the social embarrassment of talking about AI risk in 2025 so great that we should be amazed when Dario doesn’t totally dodge the topic, while running one of the main companies building the tech?
I think Dario has been more reasonable on this issue than Gary Marcus. I also don’t think “clearing Gary Marcus” is the criterion we should be using to judge the CEO of Anthropic.
Specifically, this debate (from my perspective) isn’t about whether Anthropic or others have ever said anything scary-sounding, if an x-risk person goes digging for cherry-picked quotes to signal-boost. The question is whether the average statement from Anthropic, weighted by how visible Anthropic tries to make that statement, is adequate for informing the uninformed about the insane situation we’re in.
Is the average statement from Dario or Anthropic communicating, “Holy shit, the technology we and our competitors are building has a high chance of killing us all or otherwise devastating the world, on a timescale of years, not decades. This is terrifying, and we urgently call on policymakers and researchers to help find a solution right now”? Or is it communicating, “Mythos is our most aligned model yet! ☺️ Powerful AI could have benefits, but it could have costs too. AI is a big deal, and it could have impacts and pose challenges! We are taking these very seriously! Also, unlike our competitors, Claude will always be ad-free! We’re a normal company talking about the importance of safety and responsibility in this transformative period. ☺️”
(Case in point: https://x.com/HumanHarlan/status/2031981447377273273)
If Anthropic’s messaging were awful, but Dario’s personal communications were reliably great, then I’d at least give partial credit. But Dario’s messaging is often even worse than that. Dario has been the AI CEO agitating the earliest and loudest for racing against China. He’s the one who’s been loudest about there being no point in trying to coordinate with China on this issue. “The Adolescence of Technology” opens with a tirade full of strawmen of what seems to be Yudkowsky/Soares’ position (https://x.com/robbensinger/status/2016607060591595924), and per Ryan Greenblatt, the essay sends a super misleading message about whether Anthropic “has things covered” on the technical alignment side (https://x.com/RyanPGreenblatt/status/2016553987861000238):
I also strongly agree with Ryan re:
“I think it’s important to emphasize the severity of outcomes and I think people skimming the essay may not realize exactly what Dario thinks is at stake. A substantial possibility of the majority of humans being killed should be jarring.”
“I wish Dario more clearly distinguished between what he thinks a reasonable government should do given his understanding of the situation and what he thinks should happen given limited political will. I’d guess Dario thinks that very strong government action would be justified without further evidence of risk (but perhaps with evidence of capabilities) if there was high political will for action (reducing backlash risks).”
(And I claim that Anthropic leadership has been doing this for years; “The Adolescence of Technology” is not a one-off.)
On podcast interviews, Dario sometimes lets slip an unusually candid and striking statement about how insane and dangerous the situation is, without couching it in caveats about how Everything Is Uncertain and More Evidence Is Needed and It’s Premature For Governments To Do Much About This. Sometimes, he even says it in a way that non-insiders are likely to understand. But when he talks to lawmakers, he says things like:
Never mind the merits of “the policy world should totally ignore superintelligence”. Even if you agree with that (IMO extreme and false) claim, there is no justifying calling these risks “long-term”, “abstract”, and “distant” when you have timelines a fraction as aggressive as Dario’s!!
See also Jack Clark’s communication on this issue, and my criticism at the time (https://x.com/robbensinger/status/1834325868032012296). This was in 2024. I don’t think it’s great for Dario to be systematically making the same incredibly misleading elisions two years after this pretty major issue was pointed out to his co-founder.
I’m not criticizing Anthropic or Open Phil for being “careful how they phrase things”. I’m criticizing them for being careful in exactly the wrong direction. Any communication they send out that sends a “we have things covered, this is business-as-usual, no need to worry” signal is potentially not just factually misleading, but destructive of society’s ability to orient to what’s happening and course-correct. Anthropic is the “Machines of Loving Grace” company; it’s exactly the company that has put way more effort, early and often, into communicating how powerful and cool this technology is, while being consistently nervous and hedged about alerting others to the hazards.
This is exactly the opposite of what “being careful how you phrase things” should look like. Anthropic should have internal processes for catching any tweet that risks implicitly sending a “this is business-as-normal” or “we have everything handled” message, to either filter those out or flag them for evaluation. Sending that kind of message is much more dangerous than any ordinary reputational risk a company faces.
Re ‘MIRI is saying strategy is bad, but if MIRI had been strategic then they might not have started the deep learning revolution’: I think that this just didn’t happen. Per the https://x.com/allTheYud/status/2042362484976468053 thread, I think this is just a myth that propagates because it’s funny. (And because Sam Altman is good at spreading narratives that help him out.)
I don’t think MIRI accelerated timelines on net, and if it did, I don’t think the effect was large. I’d also say that if this happened, it was in spite of one of MIRI’s top obsessions for the last 20+ years being “be ultra cautious around messaging that could shorten AI timelines”.
(Like, as someone who’s been at MIRI for 13 years, this is literally one of the top annoying things constraining everything I’ve written and all the major projects I’ve seen my colleagues work on. Not because we think we’re geniuses sitting on a trove of capabilities insights, but just because we take the responsibility of not-accidentally-contributing-to-the-race extraordinarily seriously.)
But whatever, sure. If you want to accuse MIRI of hypocrisy and say that we’re just as culpable as the AI labs, go for it. You can think MIRI is terrible in every way and also think that the Anthropic cluster is not handling AI risk in a remotely responsible way.
Set aside the years of Anthropic poisoning the commons with its public messaging, poisoning efforts at international coordination by being the top lab preemptively shitting on the possibility of US-China coordination, and poisoning the US government’s ability to orient to what’s happening by selling half-truths and absurd frames to Senate committees.
Even without looking at their broad public communications, and without critiquing what passes for a superintelligence alignment or deployment plan in Anthropic’s public communications, Anthropic has behaved absurdly irresponsibly, lying to the public about their RSP being a binding commitment, lying to their investors re ‘we’re not going to accelerate capabilities progress’, and specifically targeting the most dangerous and difficult-to-control AI capabilities (recursive self-improvement) in a way that may burn years off of the remaining timeline.
Just to be clear: nowhere in this thread, or anywhere else, have I asked Anthropic to say something like that. Everything I’ve said above is compatible with thinking that Anthropic has a chance at solving superintelligence alignment. “I think I have a chance at solving superintelligence alignment!” is not an excuse for Anthropic or Dario’s behavior.
I agree it’s too glib as an argument for “international coordination to ban superintelligence is easy”. It isn’t easy. In the context of a conversation where most people are seriously underweighting the possibility, “governments have been known to ban scary or weird tech” and “governments have been known to enact policies that cost them money” are useful correctives, but they should be correctives pointing toward “this seems hard but maybe doable”, not “this seems easy”.
How are we doing that, exactly?
Like, this is one of the most foregrounded claims in Dario’s essay. He repeats a bunch of easily-checked falsehoods about the MIRI argument, at the very start of the essay, while warning that this view’s skepticism about alignment tractability is a “self-fulfilling belief”. He then proceeds to shit on the possibility of the US coordinating with China to avoid building superintelligence, which seems like a much more classic example of “belief that could easily be self-fulfilling”.
What is the mechanism whereby Dario criticizing MIRI is “cooperating” (is it that he didn’t mention us by name, preventing people from fact-checking any of his claims?), and MIRI staff criticizing Dario is “defecting”? What, specifically, is the wrench I’m throwing in Anthropic’s plans by tweeting about this? Is a key researcher on Chris Olah’s team going to get depressed and stop doing interpretability research unless I contribute to the “Anthropic is the Good Guys and OpenAI is the Bad Guys” narrative? Is Anthropic at risk of losing its lead in the race if MIRI people are open about their view that all the labs are behaving atrociously? Should I have dropped in a claim that everyone who disagrees with me is “quasi-religious”, the same way Dario’s cooperative essay begins?
If you think I’m factually mistaken, as you said at the start of your reply, then that makes sense. But surely that would be an equally valid criticism whether I were saying pro-Anthropic stuff or anti-Anthropic stuff. Why this separate “MIRI is defecting” idea?
Yeah. And when MIRI voiced early skepticism of OpenAI in private conversation, we were told that it was crucial to support Sam and Elon’s effort because Demis was untrustworthy. Counting up from zero, OpenAI could be framed as amazing progress: a nonprofit! Run by people vocally alarmed about x-risk! And they’re struggling for cash in the near term (in spite of verbal promises of funding from Musk), which gives us an opportunity to buy seats on the board!
Anthropic may or may not be slightly better than OpenAI. OpenAI may or may not be slightly better than DeepMind. I don’t think the lesson of history is that OpenPhil-cluster people are good at telling the difference between “this is marginally better than what the other guys are doing” and “this is good enough to actually succeed”.
But nothing I’ve said above depends on that claim. You can disagree with me about how likely Anthropic is to save the world, and still think there’s an egregious candor gap between the average Anthropic public statement and the scariest paragraphs buried in “The Adolescence of Technology”, and a further egregious candor gap between “The Adolescence of Technology” and e.g. Ryan Greenblatt’s post or https://x.com/MaskedTorah/status/2040270860846768203.
I don’t think the “circle-the-wagon” approach has served EA well throughout its history, and I don’t think people self-censoring to that degree is good for governments’ or labs’ ability to orient to reality.
Some helpful points, thanks. I responded in more depth on Twitter, but I don’t want to duplicate every conversation there here, so I’m just signposting that people should check the thread there for most of my opinions.
I used to support such a portfolio approach, but subsequently realized that it’s actually not safe (i.e., is potentially net-negative even aside from opportunity costs), or the portfolio has be restricted a lot. This is because due to the existence of illegible AI safety problems, solving some (i.e., more legible) AI safety problems can actually make the overall situation worse, by increasing the chances of an unsafe AI being developed or deployed.
According to this logic, safer strategies include:
Pausing AI, and other actions that help broadly with both legible and illegible problems, like improving societal epistemic health.
Making illegible problems more legible.
Working directly on illegible problems.
Another reason to think that many “AI safety strategies” are actually not safe is that even nominally altruistic humans are more power/status-seeking[1] than actually altruistic, and one way this manifests is that they tend to neglect risks more than they should (if they were actually altruistic). See my Managing risks while trying to do good. BTW these days I think not making this idea more prominent early in rationalism/EA/AI safety is a core failure that is upstream of many other errors.
I have an old post about power/status being fundamental to human motivation, which I remember @Scott Alexander liked.
For the purposes of this argument to work, it’s important that the legible problems are so legible that a lack of solutions would prevent deployment.
When previously asked which problems were in this category, you said:
Now, I would actually say that this list overestimates AI companies’ willingness to gate deployment on unsolved problems. There’s been many woke versions of grok, suggesting they weren’t gating deployments on that. I think most current models can be jailbroken into helping with terrorism (they’re just not smart enough to be very helpful yet). It remains to be seen whether companies will hold off on releasing models that could help a lot with terrorism. I’m not so sure they will.
But even if we took this on face value: It doesn’t seem like avoiding work on these mentioned problems would mean restricting the portfolio a lot. When referring to “playing a portfolio of all the different desperate hard strategies in the hopes that one of them works”, I think that’s mostly about solving problems that wouldn’t prevent deployment if they were unsolved, or gathering evidence for such illegible problems. (Centrally: The problem of scheming models taking over the world, which is not one that I expect companies to wait for a solution on absent further evidence that it’s a problem.)
Applying the idea is tricky and context-dependent. For example, gathering evidence for scheming seems unambiguously good, but actually solving scheming could be bad (unless you’re sure that such evidence can’t be gathered, or companies will not gate on this problem regardless), because some time in the future, it may well become legible enough to be gating deployment. (Also keep in mind that it’s not just legibility/gating by the companies, but also by other policymakers such as voters and politicians.)
Given the tradeoffs apparent to me (including that the benefits of solving scheming are limited by other safety problems), I think it may well be an example of a safety problem that is net negative to work on, and something I wouldn’t want to do myself. But I’m unsure how to argue for this convincingly (and also am just not certain enough to want to talk other people out of working on this specifically) which is why I’m only talking about it in response to your comment.
Gotcha.
FWIW, on my views, work to prevent scheming looks pretty clearly great. Pausing to wait for a solution to scheming doesn’t seem super likely, and going from [scheming models widely deployed] –> [non-scheming models widely deployed] seems significantly more valuable than going from [non-scheming models widely deployed] –> [temporary pause to solve scheming].
A lot of the listed topics here are problems that we could have plenty of time to work on after the singularity. I’m sympathetic to arguments that bad things might get locked-in, but I don’t really think the arguments for this have a disjunctive nature where we’re very likely to run into at least one type of bad lock-in. There’s just a decent chance that we do an ok job of developing AIs and handing over to a society that’s more capable than us at dealing with these issues (not a super high bar), in which case a pause wouldn’t add much. (The arguments that make me feel most pessimistic about the future are arguments that humans might just not be motivated to do good things — but it’s not clear why pauses would help much with that issue.)
The aim of a pause would be to plan out the transition better, or make humans smarter/wiser so they can navigate the transition better, so that we end up handing over remaining problems to a counterfactually more capable society. In other words, the bar shouldn’t be “more capable than us” but a society that could realistically be achieved with a pause.
One issue related to this is that humans today largely want to do good things as a side effect of virtue signaling / status games that they’re doing/playing. This is currently far from optimal, which makes me scared to undergo an AI transition that could potentially lock-in such highly suboptimal motivations/values, and also scared that the AI transition could just scramble or reset these status games and remove what good motivations/values we do have. A pause would preserve the status quo and give people more time to think about such issues (including time for the idea to spread), and potentially find ways to make the AI transition go better in these regards (compared to today when there has been almost no thought on these issues at all).
But see also this recent quick take where I expressed that my optimism about a pause is pretty limited.
If the society is “more capable than us” in some average sense, where we still have certain advantages over them, then I agree that we could still contribute things.
If the society is “more capable (and good) than us” in all the important ways, then they’d also be better at making themselves smarter/wise than we would have been, and better at handling the transition, so further pauses really wouldn’t have contributed much.
Idk, I don’t know particularly want to argue about definitions here. I just think there’s a decent chance that I’ll look back after the singularity and be like “yep, the sloppy transition sure meant that we took on a bunch of ex-ante risk, but since we got lucky, extra pause time wouldn’t have helped vis-a-vis the long-run lock-in issues. Anything they could have done to help is stuff we can do better now.” (And/or: Marginal pause time may have been good or bad via various values or power changes, but it wouldn’t have systematically led to improvements from everyone’s perspective by e.g. enabling additional intellectual work, because it turns out it was fine to defer the relevant intellectual work until later.)
Even this society, if it’s in the future, then part of the transition would have already occurred, so they won’t have the opportunity to make it go better. So by not pausing now we’d permanently give up this opportunity.
Take the issue in this recent comment, of building an initial AGI that reasons well or poorly about domains that lack fast/cheap feedback signals. It seems very plausible that our long-term civilizational trajectory is significantly affected by which type of AGI gets built first. Suppose we end up building one that reasons poorly about such domains, then:
The post-AGI civilization may end up being less capable (and good) than us on average, or in some important ways.
Even if they’re actually more capable (and good) than us in all the important ways, they could have been even better if only we had built an AGI that reasons well in such domains, but they can’t go back in time and change this.
I of course agree, but I’d think this would mostly be an issue of capabilities or goodness of our future society, since there’s not much external to our society that’s getting worse as a result of the transition. Anyway, that seems like maybe one of those definitional issues. I think you’re probably right that there’s some possible changes that aren’t well characterized as being about the capabilities or goodness of our society, so an improvemet in those dimensions aren’t strictly speaking sufficient for a pause to not have been valuable.
I care more about my claim that started with “I just think there’s a decent chance...”. (Which is importantly only asserting a decent chance, not saying that there aren’t plausible ways it could be false.)
Copying over my response to Scott from Twitter (with a few additions in square brackets):
I think my biggest disagreement here is about the concept of strategic communications.
In particular, you claim that MIRI should have been more PR-strategic to avoid hyping AI enough that DeepMind and OpenAI were founded.
Firstly, a lot of this was not-very-MIRI. E.g. contrast Bostrom’s NYT bestseller with Eliezer popularizing AI risk via fanfiction, which is certainly aimed much more at sincere nerds. And I don’t think MIRI planned (or maybe even endorsed?) the Puerto Rico conference.
But secondly, even insofar as MIRI was doing that, creating a lot of hype about AI is also what a bunch of the allegedly PR-strategic people are doing right now! Including stuff like Situational Awareness and AI 2027, as well as Anthropic. [So it’s very odd to explain previous hype as a result of not being strategic enough.]
You could claim that the situation is so different that the optimal strategy has flipped. That’s possible, although I think the current round of hype plausibly exacerbates a US-China race in the same way that the last round exacerbated the within-US race, which would be really bad.
But more plausible to me is the idea that being loud and hype-y is often a kind of self-interested PR strategy which gets you attention and proximity to power without actually making the situation much better, because power is typically going to do extremely dumb stuff in response. And so to me a much better distinction is something like “PR strategies driven by social cognition” (which includes both hyping stuff and also playing clever games about how you think people will interpret you) vs “honest discourse”.
To be clear I don’t have a strong opinion about how much IABIED fits into one category vs the other, seems like a mix. A more central example of the former is Situational Awareness. A more central example of the latter is the Racing to the Precipice paper, which lays out many of the same ideas without the social cognition.
My other big disagreement is about which alignment work will help, and how. Here I have a somewhat odd position of both being relatively optimistic about alignment in general, and also thinking that almost all work in the field is bad. This seems like too big a thing to debate here but maybe the core claim is that there’s some systematic bias which ends up with “alignment researchers” doing stuff that in hindsight was pretty clearly mainly pushing capabilities.
Probably the clearest example is how many alignment researchers worked on WebGPT, the precursor to ChatGPT. If your “alignment research” directly leads to the biggest boost for the AI field maybe ever, you should get suspicious! I have more detailed modes of this which I’ll write up later but suffice to say that we should strongly expect Ilya to fall into similar traps (especially given the form factor of SSI) and probably Jan too. So without defusing this dynamic, a lot of your claimed wins don’t stand up.
More than any other group I’ve been a part of, rationalists love to develop extremely long and complicated social grievances with each other, taking pages and pages of text to articulate. Maybe I’m just too stupid to understand the high level strategic nuances of what’s going on—what are these people even arguing about? The exact flavor of comms presented over the last ten years?
As someone who spends a significant part of his time briefing policymakers in Europe, ministerial advisors, senior civil servants in AI governance, I want to point out something obvious from where I stand, but absent from this discussion.
The “radical transparency vs. strategic communication” debate presupposes that framing is the bottleneck. It isn’t. The bottleneck is volume. Most policymakers have never heard the argument, no matter how you frame it. Among the ones I interact with, maybe 2% have been exposed to the problem enough to have an opinion. Another 10% or so have heard something, but mostly through the Yann LeCun-adjacent dismissals, and formed their view from that. The remaining ~88%, including people in very important AI governance positions, have simply never had the conversation.
The question of which approach works better is real but secondary. What’s missing is more people doing this work at all. It’s a campaign, and the limiting factor is coverage, not the message.
To give a concrete data point: the only policymaker in my circles who has ever brought up “If Anyone Builds It, Everyone Dies” is Lord Tim Clement-Jones, chair of the All-Party Parliamentary Group on AI in the UK. And he was probably already sympathetic. That’s one person.
Um, I think that long, detailed, audited arguments are how we do a substantial amount of social capital and resource allocation around these parts.
And also, um, it is better than most alternative ways of doing it (e.g. networking, politicking).
Among other things, the fact that one of the leading ASI lab is substantially downstream of us. Separately, a lot of real actual politics that tends to happen in the community around prestige and money and talent allocation and respect, which needs to get litigated somehow (and abuse of power and legitimacy is common and if you can’t talk about it you can’t have norms about it).
I think if your main interactions with PauseAI is a certain Twitter account, as served to you by the algorithm in interactions with your AI safety friends, then you might think that they’re mostly going after other, more moderate safety advocates. But this just isn’t a good picture of the overall actions of the movement. At least in the case of PauseAI UK, of which I have a decent understanding of our inner workings, essentially zero time is spent thinking about other AI safety advocates. I expect that the same is true of Yudkowsky and MIRI.
Of course it is the case being rude towards people working on safety teams at OpenAI on Twitter makes some things worse on some axes. And this is mostly bad and pointless and I don’t endorse it. But that’s not even really what that post from Rob was doing! Rob was writing an opinionated, but civil, criticism. In what way is this “knifing” the other AI safety advocates? It’s not like MIRI killed SB 1047.
Now if Scott means something like “Giving money to MIRI pushes the world in the MIRI-preferred direction, and this would have meant no Anthropic and no safety team at OpenAI” then I can kind of maybe see what he means here. This just isn’t “knifing” in the sense of the betrayal that most people mean by the word. It’s just opposing someone’s plan, in a way that they’ve been doing for years. It’s not like MIRI would have actually used marginal resources to stop Anthropic from being created by, like, sabotage or something.
MIRI don’t even say that working in safety is bad! They only say that they think their approach is better. IABIED specifically states that they think mech interp researchers are “heroes” (as part an example of research they think won’t work in time without political action).
There is a phenomenon in which rationalists sometimes make predictions about the future, and they seem to completely forget their other belief that we’re heading toward a singularity (good or bad) relatively soon. It’s ubiquitous, and it kind of drives me insane. Consider these two tweets:
Timelines are really uncertain and you can always make predictions conditional on “no singularity”. Even if singularity happens you can always ask superintelligence “hey, what would be the consequences of this particular intervention in business-as-usual scenario” and be vindicated.
This is true, but then why not state “conditional on no singularity” if they intended that? I somehow don’t buy that that’s what they meant
Why would they spend ~30 characters in a tweet to be slightly more precise while making their point more alienating to normal people who, by and large, do not believe in a singularity and think people who do are faintly ridiculous? The incentives simply are not there.
And that’s assuming they think the singularity is imminent enough that their tweets won’t be born out even beforehand. And assuming that they aren’t mostly just playing signaling games—both of these tweets read less as sober analysis to me, and more like in-group signaling.
Absolutely agreed. Wider public social norms are heavily against even mentioning any sort of major disruption due to AI in the near future (unless limited to specific jobs or copyright), and most people don’t even understand how to think about conditional predictions. Combining the two is just the sort of thing strange people like us do.
Because that’s a mouthful? And the default for an ordinary person (which is potentially most of their readers) is “no Singularity”, and the people expecting the Singularity can infer that it’s clearly about a no-Singularity branch.
I think the general population doesn’t know all that much about singularity, so adding that to the part would just unnecessarily dilute it.
This is definitely baked in for many people (e.g. me, but also see the discussion here for example).
See also: population decline discourse
I think Richard has one to two decade timelines?
Two decades don’t seem like enough to generate the effect he’s talking about. He might disagree though.
Conditional on being around to look back, it seems pretty plausible to me that lack of trust and competence within major powers will have made the outcome of AGI significantly worse than it could have been.
A (partial, not very good) analogy is that, at this point, the developed world is pretty altruistic towards the developing world (e.g. to the tune of many billions of dollars of aid per year). But the developing world might still really wish it’d had fewer internal ethno-religious fractures during the Industrial Revolution (or indeed at at any time since then).
For a while now, some people have been saying they ‘kinda dislike LW culture,’ but for two opposite reasons, with each group assuming LW is dominated by the other—or at least it seems that way when they talk about it. Consider, for example, janus and TurnTrout who recently stopped posting here directly. They’re at opposite ends and with clashing epistemic norms, each complaining that LW is too much like the group the other represents. But in my mind, they’re both LW-members-extraordinaires. LW is clearly obviously both, and I think that’s great.
What are the two groups in question here?
I think it’s probably more of a spectrum than two distinct groups, and I tried to pick two extremes. On one end, there are the empirical alignment people, like Anthropic and Redwood; on the other, pure conceptual researchers and the LLM whisperers like Janus, and there are shades in between, like MIRI and Paul Christiano. I’m not even sure this fits neatly on one axis, but probably the biggest divide is empirical vs. conceptual. There are other splits too, like rigor vs. exploration or legibility vs. ‘lore,’ and the preferences kinda seem correlated.
Whenever I try to “learn what’s going on with AI alignment” I wind up on some article about whether dogs know enough words to have thoughts or something. I don’t really want to kill off the theoretical term (it can peek into the future a little later and function more independent of technology, basically) but it seems like kind of a poor way to answer stuff like: what’s going on now, or if all the AI companies allowed me to write their 6 month goals, what would I put on it.
I’m curious about what people disagree with regarding this comment. Also, I guess since people upvoted and agreed with the first one, they do have two groups in mind, but they’re not quite the same as the ones I was thinking about (which is interesting and mildly funny!). So, what was your slicing up of the alignment research x LW scene that’s consistent with my first comment but different from my description in the second comment?
On first approximation, in a group, if people at both ends of a dimension are about equally unhappy with whst the moderate middle does, assuming that is actually reasonable, but hard to know, then it’s probably balanced.
People are very worried about a future in which a lot of the Internet is AI-generated. I’m kinda not. So far, AIs are more truth-tracking and kinder than humans. I think the default (conditional on OK alignment) is that an Internet that includes a much higher population of AIs is a much better experience for humans than the current Internet, which is full of bullying and lies.
All such discussions hinge on AI being relatively aligned, though. Of course, an Internet full of misaligned AIs would be bad for humans, but the reason is human disempowerment, not any of the usual reasons people say such an Internet would be terrible.
I think the problem is that the competitive dynamics that make humans worse on the internet (eg short epistemically-ungrounded outrage bait gets more engagement than more careful and reasoned analysis) will apply to AIs as well as to humans.
Yup, but the AIs are massively less likely to help with creating cruel content. There will be a huge asymmetry in what they will be willing to generate.
Imagine an Internet where half the population is Grant Sanderson (the creator of 3Blue1Brown). That’d be awesome. Grant Sanderson has the same incentives as anyone else to create cruel and false content, but he just doesn’t.
That would be awesome! For me!
But I don’t think that the majority of people in the world would prefer that to the current internet, much less actually engage with it more than the current internet. Most people find math boring (even when it is explained as well as when Grant does the explaining). There would be an incentive to produce content that is more engaging for most of the population than linear algebra explanations.
One difference between the releases of previous GPT versions and the release of GPT-5 is that it was clear that the previous versions were much bigger models trained with more compute than their predecessors. With the release of GPT-5, it’s very unclear to me what OpenAI did exactly. If, instead of GPT-5, we had gotten a release that was simply an update of 4o + a new reasoning model (e.g., o4 or o5) + a router model, I wouldn’t have been surprised by their capabilities. If instead GPT-4 were called something like GPT-3.6, we would all have been more or less equally impressed, no matter the naming. The number after “GPT” used to track something pretty specific that had to do with some properties of the base model, and I’m not sure it’s still tracking the same thing now. Maybe it does, but it’s not super clear from reading OpenAI’s comms and from talking with the model itself. For example, it seems too fast to be larger than GPT-4.5.
A “GPT-5” named according to the previous convention in terms of pretraining compute would need at least 1e27 FLOPs (50x original GPT-4), which on H100/H200 can at best be done in FP8. Which could be done with 150K H100s for 3 months at 40% utilization. (GB200 NVL72 is too recent to use for this pretraining run, though there is a remote possiblity of B200.) A compute optimal shape for this model would be something like 8T total params, 1T active[1].
The speed of GPT-5 could be explained by using GB200 NVL72 for inference, even if it’s an 8T total param model. GPT-4.5 was slow and expensive likely because it needed many older 8-chip servers (which have 0.64-1.44 TB of HBM) to keep in HBM with room for KV caches, but a single GB200 NVL72 has 14 TB of HBM. At the same time, it wouldn’t help as much with the speed of smaller models (but it would help with their output token cost because you can fit more KV cache in the same NVLink domain, which isn’t necessarily yet being reflected in prices, since GB200 NVL72 is still scarce). So it remains somewhat plausible that GPT-5 is essentially GPT-4.5-thinking running on better hardware.
Its performance (quality, not speed) though suggests that it might well be a smaller model with pretraining scale RLVR (possibly like Grok 4, just done better). Also, doing RLVR on an 8T total param model using the older 8-chip servers would be slow/inefficient, and GB200 NVL72 might’ve only started appearing in large numbers in late Apr 2025. METR report on GPT-5 states they gained access “four weeks prior to its release”, which means it was already essentially done by end on Jun 2025. So RLVRing a very large model on GB200 NVL72 is in principle possible in this timeframe, but probably not what happened, and more to the point given its level of performance probably not what needed to happen. This way, they get a better within-model gross margin and can work on the actual very large model in peace, maybe they’ll call it “GPT-6″.
This is assuming 120 tokens/param as compute optimal, 40 tokens/param from Llama 3 405B as the dense anchor, and 3x that for 1:8 sparsity.
Ah, interesting! So the speed we see shouldn’t tell us much about GPT-5′s size.
I omitted one other factor from my shortform, namely cost. Do you think OpenAI would be willing to serve an 8T params (1T active) model for the price we’re seeing? I’m basically trying to understand whether GPT-5 being served for relatively cheap should be a large or small update.
Prefill (processing of input tokens) is efficient, something like 60% compute utilization might be possible, and that only depends on the number of active params. Generation of output tokens is HBM bandwidth bound, depends on the number of total params and the number of KV cache sequences for requests in a batch that fit on the same system (which share the cost of chip-time[1]). With GB200 NVL72, batches could be huge, dividing the cost of output tokens (still probably several times more expensive per token than prefill).
For prefill, we can directly estimate at-cost inference from the capital cost of compute hardware, assuming a need to pay it back in 3 years (it will likely serve longer but become increasingly obsolete). An H100 system costs about $50K per chip ($5bn for a 100K H100s system). This is all-in for compute equipment, so with networking but without buildings and cooling, since those serve longer and don’t need to be paid back in 3 years. Operational costs are maybe below 20%, which gives $20K per year per chip, or $2.3 per H100-hour. On gpulist, there are many listings at $1.80 per H100-hour, so my methodology might be somewhat overestimating the bare bones cost.
For GB200 NVL72, which are still too scarce to get a visible market price anywhere close to at-cost, the all-in cost together with external networking in a large system is plausibly around $5M per 72-chip rack ($7bn for a 100K chip GB200 NVL72 system, $30bn for Stargate Abilene’s 400K chips in GB200/GB300 NVL72 racks). This is 70K capital cost per chip, or 27.7K per year that pay it back in 3 years with 20% operational costs. This is just $3.2 per chip-hour.
A 1T active param model consumes 2e18 FLOPs for 1M tokens. GB200 chips can do 5e15 FP8 FLOP/s or 10e15 FP4 FLOP/s. At $3.2 per chip-hour and 60% utilization (for prefill), this translates to $0.6 per million input tokens at FP8, or $0.3 per million input tokens at FP4. The API price for the batch mode of GPT-5 is $0.62 input, $5 output. So it might even be possible with FP8. And the 8T total params wouldn’t matter with GB200 NVL72, they fit with space to spare in just one rack/domain.
This is an at-cost estimate, in contrast to the cloud provider prices. Oracle is currently selling 4-chip instances from GB200 at $16 per chip-hour. But it’s barely on the market for now, so the prices don’t yet reflect costs. And for example GCP is still selling an H100-hour for $8 (a3-megagpu-8g instances). So for the major clouds, the price of GB200 might end up only coming down to $11 per chip-hour in 2026-2027, even though the bare bones at-cost price is only $3.2 per chip-hour (or a bit lower).
I’m counting chips rather that GPUs to future-proof my terminology, since Huang recently proclaimed that starting with Rubin, compute dies will be considered GPUs (at March 2025 GTC, 1:28:04 into the keynote), so that a single chip will have 2 GPUs, and with Rubin Ultra a single chip will have 4 GPUs. It doesn’t help that Blackwell already has 2 compute dies per chip. This is sure to lead to confusion when counting things in GPUs, but counting in chips will remain less ambiguous.
Possibly an unlikely possibility, but could it be that different versions of GPT-5 (ie., normal model, thinking model, and thinking-pro model) are actually of different sizes? Or do we know for sure that they all share the same architecture?
Alignment seems quite similar to the problem of imbuing AIs with artistic taste. Morality and taste are both hard to verify and subjective (or inter-subjective). Alignment has in practice the further difficulty that deception may play a role. I.e., even after managing to train moral principles into an AI system, you have to make sure they actually act as a guide for action.
That said, my very subjective impression is that AI is far ahead in terms of ethical taste compared to artistic taste. Perhaps this is thanks to the fact that alignment has been considered a core AI problem for a much longer time.
Disagree for the meaning of “alignment” I most care about, where alignment is about trying to do what the operator wants.
“make the world better!” <-- doing what this operator wants requires some aesthetic sensibility.
You missed “trying.” Succeeding requires certain capabilities, but trying to do it does not. I believe there’s much more risk from AIs not trying to do what the operator wants than trying and failing.
i see your point, thanks.
If you assign a different meaning to the word, then you’re talking about a different thing, and the point changes accordingly.
I agree.
the observation—that aesthetics and morality are complementary—is very sharp. but i’m not sure what it has to do with alignment in particular.
it’s a bit like saying “the problem of getting ais to cultivate chickens is just a problem of getting them to cultivate eggs!”[1] true, of course, but more a fact about chickens and eggs than about ais.
(the chickens will be cultivated to live happy and fulfilling chicken lives, in this hypothetical.)
Previously, I said:
I feel good about this prediction so far. Instagram and TikTok have now a significant amount of AI-generated videos (though they haven’t overrun these platforms by any means). The categories I’ve seen so far are:
- Low-brow animated stories.
- Fantasy or sci-fi scenarios with music.
- Colorful AI-generated art.
- Cute meme animals.
The greatest sin of this content is that it’s often low quality. But it’s not really that great of a sin. I think, all things considered, AI slop is above average content. Other content often contains bullying, meanness, lies. AI-generated content rarely so.
Also, so far, this is mostly thanks to humans and to AI guardrails, not really due to the character of AIs as I expected in my initial quick take. It looks like humans are using this tech in mostly good-spirited ways so far.
There is also a significant category of “AI video passing itself off as a real video”, and many videos have people debating in the comments if it’s real or AI. This seems like it can erode trust and is generally negative.
This seems anecdotal.
So far, we have documented cases of Generative AI being used to subvert elections in Romania (actually causing an annulment), and to some extent in NYC. We also have this report by OpenAI from 2024, which details such influence operations facilitated using OAI’s API. Given the proliferation of significantly more capable open-source models in the 18 months since, we can be fairly confident that broader, more complex operations are taking place today.
I also tend to associate AI-slop with low quality content, but we know AI is more capable than that, which leads me to believe that significant amounts of content online are parts of influence operations by malicious actors.
AFAIK that was not because of Gen AI, though the broader point of your comment does stand.
Rationalists and Pause AI people on X are accusing Davidad of suffering of AI psychosis. I think it’s them who have lost the plot actually, not Davidad. The move here looks political, rather than truth-tracking. “Davidad is now my political opponent, so I’m accusing him of being crazy.” This happened to Emmet Shear too at some point.
I also strongly believe AI psychosis to be a far more limited phenomenon than people here seem to believe. I think you’re treating it as a good soldier in your army of arguments rather than investigating it truthfully for what it is.
I don’t think Davidad has AI psychosis (his views seem to be quite coherent and aligned with his long-standing views on moral realism). But I think he is quirky and maybe expressing his views in a deliberately provocative way. He implied in his thread on leaving ARIA that it was his equivalent of the Death with Dignity post, which was also on April Fool’s, and that thread definitely had some weird stuff (e.g. calling it “Alignment with Awakening”).
Which rationalists?
Most likely the OP means this quick take by Ivan Vendrov. Davidad ended up believing that the LLMs have been grokking the Natural Abstract Goodness which is unlikely to be capturable by existing benchmarks. While I do buy the idea that the NAG exists, I don’t think that I understand how one can check that the LLMs really understood it.
Davidad also announced the other day that he’s leaving ARIA to pursue a research agenda focused on working with AIs on moral philosophy.
https://x.com/davidad/status/2039390998694891816
If we’re in a history simulation, I don’t think it’s unlikely that the simulators will just set us free in their reality. I’m expecting a more enlightened humanity to consider simulated humans moral patients.
There’s no real difference between a simulated human and a human. Both are causally-interacting signals in a computer. If you inhabit a simulation, you automatically inhabit base reality just as much as native base-reality beings. They’re also just interacting signals. It’s just that you’re embedded in some other software and they’re not, but that’s not fundamental at all.
If you thought simulated humans were moral patients, you might decide not to run history simulations. That being said, I don’t think that is an absolute rule.
What are they simulating us for? The reasons are pretty important for whether they’d want to treat us ethically. I have encountered only one reason I found really convincing.
Honestly, I wasn’t picturing their reasons. Say more?
To be clear, I generally disagree with most varieties of the simulation hypothesis. But nevertheless I do think that this question in particular has some good answers—a lot of which more-or-less reduce to ‘forecasting outcomes via an agent-based simulation’. Unfortunately that family of explanations wouldn’t give you much in the way of understanding the simulators, as it could be anything from “simulate outcomes of X choice in a copy of [the simulator’s] world” (which would tell you that they’d be very similar to us) to “consider this strange hypothetical where apes evolved intelligence, for purely academic reasons” (which would tell you much less). Alternatively there are some which veer far away from that, to e.g. entertainment, or pleasure of some kind (e.g. an elaborate form of pornography?), or even some kind of long-form training algorithm, and I do think that in those cases you can infer more about the motivations of the simulator.
If I spend one century on Earth and one millenium in heaven, why am I on Earth?
(SIA can answer this, but only by giving an account of why there’s much more total experience in the world if sims are resurrected than not. The most plausible means here are that the latter is correlated with cooperative norms prevailing across the cosmos generally, such that a much larger surplus is available for conscious experience in general. But then taking SIA seriously means embracing infinite populations and I get very confused about how to reason then.)
One slightly trollish answer is “someone figures out how to merge minds and this turns out to be a highly desirable thing to do, so most independent observer-moments are before the point where mind-merging is common”.
People seemed confused by my take here, but it’s the same take Davidad expressed in this thread that has been making rounds: https://x.com/davidad/status/2011845180484133071
I’m pretty sure a human brain could, in principle, visualize a 4D space just as well as it visualizes a 3D space, and that there are ways to make that happen via neurotech (as an upper bound on difficulty).
Consider: we know a lot about how 4-dimensional spaces behave mathematically, probably no less than how 3-dimensional spaces work. Once we know exactly how the brain encodes and visualizes a 3D space in its neurons, we probably also understand how it would do it for a 4D space if it had sensory access to it. Given good enough neurotech, we could manually craft the circuits necessary to reason intuitively in 4D.
Also, another insight/observation: insofar as AIs can have imagination, an AI trained in a 4D environment should develop 4D imagination (i.e., the circuits necessary to navigate and imagine 4D intuitively). The same should be true about human-brain emulations in 4D simulations.
This argument seems to work for N-D space for any N which doesn’t seem right. I think we definitely do know less about 4D space than 3D, partly because we’re much more interested in 3D, partly because there’s just (a lot) more going on in 4D.
Intuitively it feels like current AI should be much better at learning navigation in 4D than human brains. Brains have real architecture-level, baked in task-specific circuits, which AI lack, and reconstructing a 3D world is arguably the most important of those. Sure, you could modify them with neurotech to change that, but you could do that for virtually any task so it doesn’t seem very meaningful.
There’s also the problem that human sensors are inherently 3D. It’s not clear how you would translate eyes into 4D. If you do pick a way to do this, and leave visual processing circuits the same, the circuits aren’t getting their expected data stream anymore. Brains are clearly pretty good at coping with this, like in blind people where visual processing circuits are (at least partially) co-opted for other things, but blind people are clearly worse at navigating the 3D world than sighted people, and it seems like the same would be true for humans vs 4D-native beings (like AI).
Ah, I get what you are saying, and I agree. It’s possible the human brain architecture, as-is, can’t process 4D, but I guess we’re mismatched in what we think is interesting. The thrust of my intuition here was more “wow, someone could understand N-D intuitively in a 3D universe, this doesn’t seem prohibited”, regardless of whether it’s the same architecture of a human brain exactly. Like, the human brain as it is right now might not permit that, and neurotech might involve doing a lot of architectural changes (the same applies to emulations). I suppose it’s a lot less interesting an insight if you already buy that imagining higher dimensions from a 3D universe is in principle possible. The human brain being able to do that is a stronger claim that would have been more interesting if I actually managed to defend it well.
I suppose I was kinda sloppy saying “the human brain can do that”—I should have said “the human brain arbitrarily modified” or something like that.
I definitely think it’s interesting that it’s possible for N-D-substrate-computations to imagine / intuit N+1-D, but yeah, I feel like that’s mostly a given because we have the concept of N+1-D in the first place.
There are different levels of “imagine / intuit” though. Some people have particularly good or bad intuition for the 3D space we live in. I took your claim to be something like “the average brain could intuit 4D just as well as 3D, maybe requiring slight modification”. I think the modifications to reach true parity would be pretty extensive, because of how much 3D-specific architecture (as opposed to weights) human brains have. I do agree the modifications are theoretically possible, but the modifications to give a fruit fly human-level cognition are also theoretically possible with arbitrary modification.
Thought experiment: If a mad scientist gave a newborn infant a third eye that was offset along a fourth spatial dimension from the baby’s other two eyes, the baby’s brain would naturally acquire the ability to visualize in four dimensions. Wiring up three eyes probably requires three visual cortices, which will have knock-on effects on the overall geometry of the brain. I doubt that it requires the brain itself to be a 4D structure though.
Has anyone proposed a solution to the hard problem of consciousness that goes:
Qualia don’t seem to be part of the world. We can’t see qualia anywhere, and we can’t tell how they arise from the physical world.
Therefore, maybe they aren’t actually part of this world.
But what does it mean they aren’t part of this world? Well, since maybe we’re in a simulation, perhaps they are part of the simulation. Basically, it could be that qualia : screen = simulation : video-game. Or, rephrasing: maybe qualia are part of base reality and not our simulated reality in the same way the computer screen we use to interact with a video game isn’t part of the video game itself.
Qualia are the only thing we[1] can see.
We don’t see objects “directly” in some sense, we experience qualia of seeing objects. Then we can interpret those via a world-model to deduce that the visual sensations we are experiencing are caused by some external objects reflecting light. The distinction is made clearer by the way that sometimes these visual experiences are not caused by external objects reflecting light, despite essentially identical qualia.
Nonetheless, it is true that we don’t know how qualia arise from the physical world. We can track back physical models of sensation until we get to stuff happening in brains, but that still doesn’t tell us why these physical processes in brains in particular matter, or whether it’s possible for an apparently fully conscious being to not have any subjective experience.
At least I presume that you and others have subjective experience of vision. I certainly can’t verify it for anyone else, just for myself. Since we’re talking about something intrinsically subjective, it’s best to be clear about this.
I don’t disagree with this at all, and it’s a pretty standard insight for someone who thought about this stuff at least a little. I think what you’re doing here is nitpicking on the meaning of the word “see” even if you’re not putting it like that.
A UBI of AI or compute kind of makes sense to me under safe superintelligence. I don’t know if AI companies will still be incentivized to sell AI if they can just use it themselves, but a state isn’t under the same incentives. Instead of centralizing the use of its AI systems, it could distribute access to all citizens so they can use it productively themselves.
Why not just give people money so they can use it to pay for compute? I guess it’s a matter of which option you expect to lead to the most productive and freedom-promoting outcome for everyone in the long term. A UBI of AI access helps ensure that people receive a directly useful productive resource, and could reduce their dependence on cash transfers once they become rich enough from using AI productively.
This scenario assumes it’s possible for superintelligence to remain an assistant, on guardrails, to each individual.
Slop is in the mind, not in the territory. I will not call slop something that I like, regardless of what other people call slop.
Reaction request: “bad source” and “good source” to use when people cite sources you deem unreliable vs. reliable.
I know I would have used the “bad source” reaction at least once.
Is anyone working on experiments that could disambiguate whether LLMs talk about consciousness because of introspection vs. “parroting of training data”? Maybe some scrubbing/ablation that would degrade performance or change answer only if introspection was useful?
Iff LLM simulacra resemble humans but are misaligned, that doesn’t bode well for S-risk chances.
Waluigi effect also seems bad for s-risk. “Optimize for pleasure, …” → “Optimize for suffering, …”.
An optimistic way to frame inner alignment is that gradient descent already hits a very narrow target in goal-space, and we just need one last push.
A pessimistic way to frame inner misalignment is that gradient descent already hits a very narrow target in goal-space, and therefore S-risk could be large.
This community has developed a bunch of good tools for helping resolve disagreements, such as double cruxing. It’s a waste that they haven’t been systematically deployed for the MIRI conversations. Those conversations could have ended up being more productive and we could’ve walked away with a succint and precise understanding about where the disagreements are and why.
We should implement Paul Christiano’s debate game with alignment researchers instead of ML systems
If you try to write a reward function, or a loss function, that caputres human values, that seems hopeless.
But if you have some interpretability techniques that let you find human values in some simulacrum of a large language model, maybe that’s less hopeless.
The difference between constructing something and recognizing it, or between proving and checking, or between producing and criticizing, and so on...
As a failure mode of specification gaming, agents might modify their own goals.
As a convergent instrumental goal, agents want to prevent their goals to be modified.
I think I know how to resolve this apparent contradiction, but I’d like to see other people’s opinions about it.
Why this shouldn’t work? What’s the epistemic failure mode being pointed at here?
I keep seeing absolutely terrible epistemics from like 50% of AI Safety. From people who previously seemed reasonable. This quick take was prompted by an example I just saw, from Connor Leahy: https://x.com/JoshWalkos/status/2021087240126976511