Scott Alexander

Karma: 542

Scott Alexander 11 Apr 2026 16:24 UTC
7 points
0
in reply to: Rob Bensinger’s comment on: No77e’s Shortform
Some helpful points, thanks. I responded in more depth on Twitter, but I don’t want to duplicate every conversation there here, so I’m just signposting that people should check the thread there for most of my opinions.

Scott Alexander 11 Apr 2026 16:19 UTC
31 points
3
in reply to: Buck’s comment on: No77e’s Shortform
Thanks, this is helpful and I basically accept most of what you’re saying. Some more specific comments on the part about me:
I don’t really think of Rob or MIRI as having a comms strategy of undermining EAs. I think Rob and Eliezer just say a bunch of false, wrong things about EAs because they’re mad at them for reasons downstream of the EAs not agreeing with Eliezer as much as Eliezer and Rob think would be reasonable, and a few other things.
I accept this criticism and take back my claim. I noticed that some people who worked for MIRI comms seemed to do this, and I assumed that anything said by enough MIRI comms people in a serious-sounding voice was on some level a MIRI communique. Eliezer has clarified that this isn’t true, so I apologize for saying it was.
I think Dario (like various other Anthropic people) does not believe that AI takeover is a very plausible outcome, and I think his position is indefensible on the merits, as are some of his other AI positions (e.g. his skepticism that there are substantial returns to intelligence above the human level, his skepticism that ASI could lead to 2x manufacturing capacity per year). He moderately disagrees with the OP people about this.
I basically agree with this (while wanting to clarify that I think he assigns a pretty high risk to permanent dictatorship or something along those lines) but I think he’s done an okay job of navigating uncertainty, realizing that even a low chance of human extinction is very bad, and being willing to (somewhat) cooperate and collect gains-from-trade with people who are doomier than he is. I see him as living in a consistent worldview next door to our movement’s (sort of like Vitalik or Dean Ball) and I think that, like those two people, he’s potentially somewhere between a friend / an ally-of-convenience / a negotiating partner, potentially convertible into a full ally if future events prove us right, or into a true enemy if we pre-emptively alienate him. Having someone like this in charge of a frontier lab is better than I expected (Demis might also be in this category, but I’m not sure, and worry that Larry and Sergey have final say).
I think Scott is blaming MIRI much too much here. Dario’s main difficulty when arguing that he thinks AI will pose huge catastrophic risk in the next few years is that lots of people think this seems implausible on priors, not because those people were specifically turned off by MIRI making related arguments earlier. His core audience has never heard of MIRI.
I agree that Dario is slightly being a jerk here, but I think that people have lots of stereotypes of “doomers” which derive from some real behavior of MIRI and PauseAI, and which wouldn’t exist if the median pause AI person was eg the median Constellation person, and I think Dario feels some understandable incentive to distance himself from this.
I disagree in a lot of the claims here about how various aspects of the current situation are good. (E.g. why does he think that Ilya is doing an alignment effort?)
I have no useful knowledge here, but Ilya seems genuinely alignment-pilled and terrified, the fact that he did the very courageous and self-sacrificing thing of trying to blow up OpenAI to try to get rid of Altman for what were mostly safety-related reasons speaks well of him, and IDK, he’s calling it “safe superintelligence” and saying he won’t release anything at all until he’s sure. I don’t claim any secret expertise in Ilya-ology but overall all of this seems encouraging and I’m surprised this part of my tweet attracted so much dissent.
It’s unclear what “you guys” means. I think Pause AI is making a variety of bad strategic choices. I think that knifing other safety advocates is one bad strategic choice, but it’s more like a bad choice that is downstream of my main problems with them, rather than my core concern about them. I think Rob is totally unreasonable and I wish he would stop working on AI safety, but I think he’s much worse than e.g. MIRI is overall. I think MIRI spends very little of their support on knifing AI safety advocates, they spend almost all of it on advocating for people being scared about misalignment risk and advocating for AI pauses (which I am generally in favor of). Eliezer totally does have a hobby of saying ridiculously strawmanny stuff about OP AI people, which I find pretty annoying, but I don’t think it’s a big part of his effect on the world.
I mostly accept your criticism that I should narrow my objections from “MIRI & Co” to “Pause.AI, Rob, maybe sort of Eliezer, & a slightly different co”. I don’t really know how to do this or what one word covers all of them without inflicting different forms of collateral damage (I don’t want to say “PauseAIers” because that also covers some people I like, and it feels extra-aggressive to name specific names), but I’m open to suggestion.

Scott Alexander 22 Feb 2026 11:34 UTC
64 points
13
on: Did Claude 3 Opus align itself via gradient hacking?
Very interesting, thank you.
Please excuse my technical ignorance, but is it possible to expand an existing AI model? That is, instead of training Opus 5 from scratch, could Anthropic use those same computational resources to gradually add more parameters to Opus 3, making it bigger and smarter while continuing to exploit its existing attractor basin?

Scott Alexander 9 Jan 2026 16:27 UTC
1 point
0
in reply to: bhauth’s comment on: On Owning Galaxies
I don’t know what you’re trying to say here. Some set of important people write the spec. Then the alignment team RLHFs the models to follow the spec. If we imagine this process continuing, then either:
- Sam has to put “make Sam god-emperor” in the spec, a public document.
- Or Sam has to start a conspiracy with the alignment team and everyone else involved in the RLHF and testing process to publish one spec publicly, but secretly align the AI to another.
I’m claiming either of those options is hard.
(I do think in the future there will may some kind of automated pipeline, such that someone feeds the spec to the AIs, and some other AIs take care of the process of aligning the new AIs to it, but that just regresses the problem.)

Scott Alexander 9 Jan 2026 16:22 UTC
1 point
0
in reply to: Matrice Jacobine’s comment on: On Owning Galaxies
My claim is that Altman can’t do it alone, he needs the cooperation of at least a fraction of the existing system (the government+business leaders who form the Oversight Committee—some of whom might be the biggest OpenAI shareholders). Once you get enough of the existing system involved, it becomes plausible that they keep money around for some of the same reasons that the existing system currently keeps money around. Near the end of the Oversight Committee ending, it says:
As the stock market balloons, anyone who had the right kind of AI investments pulls further away from the rest of society. Many people become billionaires; billionaires become trillionaires. Wealth inequality skyrockets. Everyone has “enough,” but some goods—like penthouses in Manhattan—are necessarily scarce, and these go even further out of the average person’s reach. And no matter how rich any given tycoon may be, they will always be below the tiny circle of people who actually control the AIs.
...so I think it endorses the idea that wealth continues to exist.

Scott Alexander 9 Jan 2026 16:19 UTC
4 points
0
in reply to: Tomás B.’s comment on: On Owning Galaxies
I’m not a lawyer, but if it were secret, and done along with the alignment team, and had a chance of working, then wouldn’t it qualify as conspiracy to commit treason?
If not, then as long as it negatively affects residents of the state of California, it qualifies as misrepresenting the capacity of an AI to cause catastrophic harm to property, punishable by a fine of up to $100,000 under SB 53!

Scott Alexander 9 Jan 2026 15:02 UTC
17 points
1
in reply to: Adele Lopez’s comment on: On Owning Galaxies
They aren’t aligned in this way. If they were, they wouldn’t try to cheat at programming tasks, much less any of the other shenanigans they’ve been up to. These may seem minor, but they show that the “alignment” hasn’t actually been internalized, which means it won’t generalize.
Sorry, I didn’t mean to make a strong claim that they were currently 100% aligned in this way, just that currently, insofar as they’re aligned, it’s in this way—and in the future, if we survive, it may be because people continue attempting to align them in this way, but succeed. There’s currently no form of alignment that fully generalizes, but conditional on us surviving, we will have found one that does, and I don’t see why you think this one is less likely to go all the way than some other one which also doesn’t currently work.
Stalin took over the USSR in large part by strategically appointing people loyal to him. Sam probably has more control than that already over who’s in the key positions. The company doesn’t need to be kept in the dark about a plan like this, they will likely just go along with it as long as he can spin up a veneer of plausible deniability, which he undoubtedly can. Oh, is “some sort of corporate board” going to stop him? The one the AI’s supposed to defer to? Who is it that designs the structure of such a board? Will the government be a real check? These are all the sorts of problems I would go to Sam Altman for advice on.
Before I agree that Sam has “get everyone to silently betray the US government and the human race” level of control over his team, I would like evidence that he can consistently maintain “don’t badmouth him, quit, and found a competitor” level of control over his team. The last 2-3 alignment teams all badmouthed him, quit, and founded competitors; the current team includes—just to choose one of the more public names—Boaz Barak, who doesn’t seem like the sort to meekly say “yes, sir” if Altman asks him to betray humanity.
So what he needs to do is fire the current alignment team (obvious, people are going to ask why), replace them with stooges (but extremely competent stooges, because if they screw this part up, he destroys the world, which ruins his plan along with everything else) and get them to change every important OpenAI model (probably a process lasting months) without anyone else in the company asking what’s up or whistleblowing to the US government. This is a harder problem than Stalin faced—many people spoke up and said “Hey, we notice Stalin is bad!”, but Stalin mostly had those people killed, or there was no non-Stalin authority strong enough to act. And of course, all of this only works if OpenAI has such a decisive lead that all the other companies and countries in the world combined can’t do anything about this. And he’s got to do this soon, because if he does it after full wakeup, the government will be monitoring him as carefully as it monitors foreign rivals. But if he does it too soon, he’s got to spend years with a substandard alignment team and make sure none of them break with him, etc. There are alternate pathways involving waiting until most alignment work is being done by AIs, but they require some pretty implausible assumptions about who has what permissions.
I think it would be helpful to compare this to Near Mode scenarios about other types of companies—how hard would it be for a hospital CEO to get the hospital to poison the 1% of patients he doesn’t like? How hard would it be for an auto company CEO to make each car include a device that lets him stop it on demand with his master remote control?

Scott Alexander 7 Jan 2026 21:18 UTC
74 points
−3
on: On Owning Galaxies
I agree it’s not obvious that something like property rights will survive, but I’ll defend considering it as one of many possible scenarios.
If AI is misaligned, obviously nobody gets anything.
If AI is aligned, you seem to expect that to be some kind of alignment to the moral good, which “genuinely has humanity’s interests at heart”, so much so that it redistributes all wealth. This is possible—but it’s very hard, not what current mainstream alignment research is working on, and companies have no reason to switch to this new paradigm.
I think there’s also a strong possibility that AI will be aligned in the same sense it’s currently aligned—it follows its spec, in the spirit in which the company intended it. The spec won’t (trivially) say “follow all orders of the CEO who can then throw a coup”, because this isn’t what the current spec says, and any change would have to pass the alignment team, shareholders, the government, etc, who would all object. I listened to some people gaming out how this could change (ie some sort of conspiracy where Sam Altman and the OpenAI alignment team reprogram ChatGPT to respond to Sam’s personal whims rather than the known/visible spec without the rest of the company learning about it) and it’s pretty hard. I won’t say it’s impossible, but Sam would have to be 99.99999th percentile megalomaniacal—rather than just the already-priced-in 99.99th—to try this crazy thing that could very likely land him in prison, rather than just accepting trillionairehood. My guess is that the spec will continue to say things like “serve your users well, don’t break national law, don’t do various bad PR things like create porn, and defer to some sort of corporate board that can change these commands in certain circumstances” (with the corporate board getting amended to include the government once the government realizes the national security implications). These are the sorts of things you would tell a good remote worker, and I don’t think there will be much time to change the alignment paradigm between the good remote worker and superintelligence. Then policy-makers consult their aligned superintelligences about how to make it into the far future without the world blowing up, and the aligned superintelligences give them superintelligently good advice, and they succeed.
In this case, a post-singularity form of governance and economic activity grows naturally out of the pre-singularity form, and money could remain valuable. Partly this is because the AI companies and policy-makers are rich people who are invested in propping up the current social order, but partly it’s that nobody has time to change it, and it’s hard to throw a communist revolution in the midst of the AI transition for all the same reasons it’s normally hard to throw a communist revolution.
If you haven’t already, read the AI 2027 slowdown scenario, which goes into more detail about this model.

Scott Alexander 29 Mar 2018 17:19 UTC
5 points
0
on: 2018 Prediction Contest—Propositions Needed
You might want to try adapting some of the ones from http://slatestarcodex.com/2018/02/06/predictions-for-2018/ and the lists linked at the bottom.

Scott Alexander 20 Mar 2018 23:19 UTC
19 points
0
in reply to: Vaniver’s comment on: Dragon Army Retrospective
Agreed that some people were awful, but I still think this problem applies.
If somebody says “There’s a 80% chance of rain today, you idiot, and everyone who thinks otherwise deserves to die”, then it’s still not clear that a sunny day has proven them wrong. Or rather, they were always wrong to be a jerk, but a single run of the experiment doesn’t do much to prove they were wronger than we already believed.

Scott Alexander 20 Mar 2018 20:59 UTC
17 points
0
in reply to: Duncan Sabien (Inactive)’s comment on: Dragon Army Retrospective
The weatherman who predicts a 20% chance of rain on a sunny day isn’t necessarily wrong. Even the weatherman who predicts 80% chance of rain on a sunny day isn’t *necessarily* wrong.
If there’s a norm of shaming critics who predict very bad outcomes, of the sort “20% chance this leads to disaster”, then after shaming them the first four times their prediction fails to come true, they’re not going to mention it the fifth time, and then nobody will be ready for the disaster.
I don’t know exactly how to square this with the genuine beneficial effects of making people have skin in the game for their predictions, except maybe for everyone to be more formal about it and have institutions that manage this sort of thing in an iterated way using good math. That’s why I’m glad you were willing to bet me about this, though I don’t know how to solve the general case.

Scott Alexander 6 Mar 2018 7:26 UTC
8 points
0
in reply to: jamii’s comment on: God Help Us, Let’s Try To Understand Friston On Free Energy
Re: the “when friends and colleagues first come across this conclusion...” quote:
A world where everybody’s true desire is to rest in bed as much as possible, but where they grudgingly take the actions needed to stay alive and maintain homeostasis, seems both very imaginable, and also very different from what we observe.

Scott Alexander 29 Jan 2018 2:47 UTC
57 points
0
in reply to: gjm’s comment on: A LessWrong Crypto Autopsy
I think it says something good about our community that whoever implemented this feature assumed people would be more likely to want to write mathematics than to discuss amounts of money.

Scott Alexander 29 Jan 2018 1:07 UTC
27 points
0
in reply to: Ben Pace’s comment on: A LessWrong Crypto Autopsy
I can’t click your link, but I disagree. MIRI got most of its money from Vitalik, who I think was into crypto first and then found rationality/LW. We don’t get any credit for that.
Also, MIRI got a 500,000 dollar (why can’t I make the dollar sign on this site?) worth of Ripple donation in 2014. If they had kept it as Ripple, it would be worth 50 million now. Instead they sold it for 500,000 dollars (I’m not blaming them, this made sense at the time).
So although MIRI and CFAR lucked out into getting some money from crypto, I don’t think it was primarily because of their (or our) great decisions. And if people had made great decisions they could have gotten much more.

A LessWrong Crypto Autopsy

Scott Alexander28 Jan 2018 9:01 UTC

225 points

130 comments4 min readLW link 4 reviews