Rohin Shah

Karma: 16,402

Research Scientist at Google DeepMind. Creator of the Alignment Newsletter. http://rohinshah.com/

Rohin Shah 19 Oct 2025 12:20 UTC
LW: 12 AF: 5
1
AF
on: Synthesizing Standalone World-Models (+ Bounties, Seeking Funding)
a human’s world-model is symbolically interpretable by the human mind containing it.
Say what now? This seems very false:
- See almost anything physical (riding a bike, picking things up, touch typing a keyboard, etc). If you have a dominant hand / leg, try doing some standard tasks with the non-dominant hand / leg. Seems like if the human mind could symbolically interpret its own world model this should be much easier to do.
- Basically anything to do with vision / senses. Presumably if vision was symbolically interpretable to the mind then there wouldn’t be much of a skill ladder to climb for things like painting.
- Symbolic grammar usually has to be explicitly taught to people, even though ~everyone has a world model that clearly includes grammar (in the sense that they can generate grammatical sentences and identify errors in grammar)
Tbc I can believe it’s true in some cases, e.g. I could believe that some humans’ far-mode abstract world models are approximately symbolically interpretable to their mind (though I don’t think mine is). But it seems false in the vast majority of domains (if you are measuring relative to competent, experienced people in those domains, as seems necessary if you are aiming for your system to outperform what humans can do).

Rohin Shah 12 Oct 2025 16:06 UTC
44 points
16
on: The Most Common Bad Argument In These Parts
What exactly do you propose that a Bayesian should do, upon receiving the observation that a bounded search for examples within a space did not find any such example?
(I agree that it is better if you can instead construct a tight logical argument, but usually that is not an option.)
I also don’t find the examples very compelling:
1. Security mindset—Afaict the examples here are fictional
2. Superforecasters—In my experience, superforecasters have all kinds of diverse reasons for low p(doom), some good, many bad. The one you describe doesn’t seem particularly common.
3. Rethink—Idk the details here, will pass
4. Fatima Sun Miracle: I’ll just quote Scott Alexander’s own words in the post you link:
I will admit my bias: I hope the visions of Fatima were untrue, and therefore I must also hope the Miracle of the Sun was a fake. But I’ll also admit this: at times when doing this research, I was genuinely scared and confused. If at this point you’re also scared and confused, then I’ve done my job as a writer and successfully presented the key insight of Rationalism: “It ain’t a true crisis of faith unless it could go either way”.
[...]
I don’t think we have devastated the miracle believers. We have, at best, mildly irritated them. If we are lucky, we have posited a very tenuous, skeletal draft of a materialist explanation of Fatima that does not immediately collapse upon the slightest exposure to the data. It will be for the next century’s worth of scholars to flesh it out more fully.
Overall, I’m pleasantly surprised by how bad these examples are. I would have expected much stronger examples, since on priors I expected that many people would in fact follow EFAs off a cliff, rather than treating them as evidence of moderate but not overwhelming strength. To put it another way, I expected that your FA on examples of bad EFAs would find more and/or stronger hits than it actually did, and in my attempt to better approximate Bayesianism I am noticing this observation and updating on it.

Rohin Shah 3 Oct 2025 13:00 UTC
7 points
3
on: Nice-ish, smooth takeoff (with imperfect safeguards) probably kills most “classic humans” in a few decades.
It’s instead arguing with the people who are imagining something like “business continues sort of as usual in a decentralized fashion, just faster, things are complicated and messy, but we muddle through somehow, and the result is okay.”
The argument for this position is more like: “we never have a ‘solution’ that gives us justified confidence that the AI will be aligned, but when we build the AIs, the AIs turn out to be aligned anyway”.
You seem to instead be assuming “we don’t get a ‘solution’, and so we build ASI and all instances of ASI are mostly misaligned but a bit nice, and so most people die”. I probably disagree with that position too, but imo it’s not an especially interesting position to debate, as I do agree that building ASI that is mostly misaligned but a bit nice is a bad outcome that we should try hard to prevent.

Rohin Shah 22 Sep 2025 20:57 UTC
3 points
0
in reply to: ryan_greenblatt’s comment on: The title is reasonable
Yeah, that’s fair for agendas that want to directly study the circumstances that lead to scheming. Though when thinking about those agendas, I do find myself more optimistic because they likely do not have to deal with long time horizons, whereas capabilities work likely will have to engage with that.
Note many alignment agendas don’t need to actually study potential schemers. Amplified oversight can make substantial progress without studying actual schemers (but probably will face the long horizon problem). Interpretability can make lots of foundational progress without schemers, that I would expect to mostly generalize to schemers. Control can make progress with models prompted or trained to be malicious.

Rohin Shah 22 Sep 2025 18:39 UTC
9 points
0
in reply to: So8res’s comment on: This is a review of the reviews
On reflection I think you’re right that this post isn’t doing the thing I thought it was doing, and have edited my comment.
(For reference: I don’t actually have strong takes on whether you should have chosen a different title given your beliefs. I agree that your strategy seems like a reasonable one given those beliefs, while also thinking that building a Coalition of the Concerned would have been a reasonable strategy given those beliefs. I mostly dislike the social pressure currently being applied in the direction of “those who disagree should stick to their agreements” (example) without even an acknowledgement of the asymmetricity of the request, let alone a justification for it. But I agree this post isn’t quite doing that.)

Rohin Shah 22 Sep 2025 18:17 UTC
2 points
0
in reply to: ryan_greenblatt’s comment on: The title is reasonable
Do you agree the feedback loops for capabilities are better right now?
Yes, primarily due to the asymmetry where capabilities can work with existing systems while alignment is mostly stuck waiting for future systems, but that should be much less true by the time of DAI.
Edit: though risk could be lower in the worlds where capabilities R&D goes off the rails due to having more time for safety, depending on whether this also applies to the next actor etc.
I was thinking both of this, and also that it seems quite correlated due to lack of asymmetry. Like, 20% on exactly one going off the rails rather than zero or both seems very high to me; I feel like to get to that I would want to know about some important structural differences between the problems. (Though I definitely phrased my comment poorly for communicating that.)

Rohin Shah 22 Sep 2025 16:37 UTC
3 points
0
in reply to: ryan_greenblatt’s comment on: The title is reasonable
I’m not really following where the disanalogy is coming from (like, why are the feedback loops better?)
Sure, AI societies could go off the rails that hurts alignment R&D but not AI R&D; they could also go off the rails in a way that hurts AI R&D and not alignment R&D. Not sure why I should expect one rather than the other.
Although on further reflection, even though the current DAI isn’t scheming, alignment work still has to be doing some worst-case type thinking about how future AIs might be scheming, whereas this is not needed for AI R&D. I don’t think this makes a big difference—usually I find worst-case conceptual thinking to be substantially easier than average-case conceptual thinking—but I could imagine that causing issues.

Rohin Shah 22 Sep 2025 12:02 UTC
4 points
0
in reply to: ryan_greenblatt’s comment on: The title is reasonable
Ok, but it isn’t sufficient to just ensure DAI isn’t scheming, we have to also ensure it is aligned enough to hand off work and has good epistemics and is well elicited on hard to check tasks. This seems pretty hard given the huge rush, but it isn’t obviously fucked IMO, especially given the extra years from acceleration. I have some draft writing on this which should hopefully be out somewhat soon. Maybe my view is 20% chance of failure given good allocation and roughly 60% chance of failure given the default allocation (which includes stuff like the safety team not actually handing off or not seriously working on this etc)?
This seems way too pessimistic to me. At the point of DAI, capabilities work will also require good epistemics and good elicitation on hard to check tasks. The key disanalogy between capabilities and alignment work at the point of DAI is that the DAI might be scheming, but you’re in a subjunctive case where we’ve assumed the DAI is not scheming. Whence the pessimism?
(This complaint is related to Eli’s complaint)

Rohin Shah 22 Sep 2025 11:46 UTC
5 points
3
in reply to: David Matolcsi’s comment on: The title is reasonable
Note I am thinking of a pretty specific subset of comments where Buck is engaging with people who he views as “extremely unreasonable MIRI partisans”. I’m not primarily recommending that Buck move those comments to private channels, usually my recommendation is to not bother commenting on that at all. If there does happen to be some useful kernel to discuss, then I’d recommend he do that elsewhere and then write something public with the actually useful stuff.

Rohin Shah 22 Sep 2025 10:22 UTC
48 points
1
in reply to: Raemon’s comment on: The title is reasonable
Oh huh, kinda surprised my phrasing was stronger than what you’d say.
Idk the “two monkey chieftains” is just very… strong, as a frame. Like of course #NotAllResearchers, and in reality even for a typical case there’s going to be some mix of object-level-epistemically-valid reasoning along with social-monkey reasoning, and so on.
Also, you both get many more observations than I do (by virtue of being in the Bay Area) and are paying more attention to extracting evidence / updates out of those observations around the social reality of AI safety research. I could believe that you’re correct, I don’t have anything to contradict it, I just haven’t looked enough detail to come to that conclusion myself.
Tribal thinking is just really ingrained
This might be true but feels less like the heart of the problem. Imo the bigger deal is more like trapped priors:
The basic idea of a trapped prior is purely epistemic. It can happen (in theory) even in someone who doesn’t feel emotions at all. If you gather sufficient evidence that there are no polar bears near you, and your algorithm for combining prior with new experience is just a little off, then you can end up rejecting all apparent evidence of polar bears as fake, and trapping your anti-polar-bear prior.
A person on either “side” certainly feels like they have sufficient evidence / arguments for their position (and can often list them out in detail, so it’s not pure self-deception). So premise #1 is usually satisfied.
There are tons of ways that the algorithm for combining prior with new experience can be “just a little off” to satisfy premise #2:
- When you read a new post, if it’s by “your” side everything feels consistent with your worldview so you don’t notice all the ways it is locally invalid, whereas if it’s by the “other” side you intuitively notice a wrong conclusion (because it conflicts with your worldview) which then causes you to find the places where it is locally invalid.^[1] If you aren’t correcting for this, your prior will be trapped.
  - (More broadly I think LWers greatly underestimate the extent to which almost all reasoning is locally logically invalid, and how much you have to evaluate arguments based on their context.^[2])
- Even when you do notice a local invalidity in “your” side, it is easy enough for you to repair so it doesn’t change your view. But if you notice a local invalidity in “their” side, you don’t know how to repair it and so it seems like a gaping hole. If you aren’t correcting for this, your prior will be trapped.^[3]
- When someone points out a counterargument, you note that there’s a clear slight change in position that averts the counterargument, without checking whether this should change confidence overall.^[4]
- The sides have different epistemic norms, so it is just actually true that the “other” side has more epistemic issues as-evaluated-by-your-norms than “your” side. If you aren’t correcting for this, your prior will be trapped.
  - I don’t quite know enough to pin down what the differences are, but maybe something like: “pessimists” care a lot more about precision of words and logical local validity, whereas “optimists” care a lot more about the thrust of an argument and accepted general best practices even if you can’t explain exactly how they’re compatible with Bayesianism. Idk I feel like this is not correct.
I think this is the (much bigger) challenge that you’d want to try to solve. For example, I think LW curation decisions are systematically biased for these reasons, and that likely contributes substantially to the problem with LW group epistemics.
Given that, what kinds of solutions would I be thinking about?
- Partial solution from academia: there are norms restricting people’s (influential) opinions to their domain of expertise. This creates a filter where the opinions you care about are much more likely to be the result of deep engagement with details on a given topic, and so are more likely to be correct. (Relatedly, my biggest critique of individual LW epistemics is a lack of respect for how much details matter.)
- Partial solution from academia: procedural norms around what evidence you have to show for something to become “accepted knowledge” (typically enforced via peer review).^[5]
- For curation in particular: get some “optimists” to feed into curation decisions. (Buck, Ryan, and Lukas all seem like potential candidates, seeing as they aren’t as pessimistic as me and at least Buck + Ryan already put some effort into LW group epistemics.)
Tbc I also believe that there’s lots of straightforwardly tribal thinking going on.^[6] People also mindkill themselves in ways that make them less capable of reasoning clearly.^[7] But it doesn’t feel as necessary to solve. If you had a not-that-large set of good thinking going on, that feels like it could be enough (e.g. Alignment Forum at time of launch). Just let the tribes keep on tribe-ing and mostly ignore them.
I guess all of this is somewhat in conflict with my original position that sensationalism bias is a big deal for LW group epistemics. Whoops, sorry. I do still think sensationalism and tribalism biases are a big deal but on reflection I think trapped priors are a bigger deal and more of my reason for overall pessimism.
Though for sensationalism / tribalism I’d personally consider solutions as drastic as “get rid of the karma system, accept lower motivation for users to produce content, figure out something else for identifying which posts should be surfaced to readers (maybe an LLM-based system can do a decent job)” and “much stronger moderation of tribal comments, including e.g. deleting highly-upvoted EY comments that are too combative / dismissive”.
1. ^
  For example, I think this post against counting arguments reads as though the authors noticed a local invalidity in a counting argument, then commenters on the early draft pointed out that of course there was a dependence on simplicity that most people could infer from context, and then the authors threw some FUD on simplicity. (To be clear, I endorse some of the arguments in that post, and not others, do not take this as me disendorsing that post entirely.)
2. ^
  Habryka’s commentary here seems like an example, where the literal wording of Zach’s tweet is clearly locally invalid, but I naturally read Zach’s tweet as “they’re wrong about doom being inevitable [if anyone builds it]”. (I agree it would have been better for Zach to be clearer there, but Habryka’s critique seems way too strong.)
3. ^
  For example, when reading the Asterisk review of IABIED (not the LW comments, the original review on Asterisk), I noticed that the review was locally incorrect because the IABIED authors don’t consider an intelligence explosion to be necessary for doom, but also I could immediately repair it to “it’s not clear why these arguments should make you confident in doom if you don’t have a very fast takeoff” (that being my position). (Tbc I haven’t read IABIED, I just know the authors’ arguments well enough to predict what the book would say.) But I expect people on the “MIRI side” would mostly note “incorrect” and fail to predict the repair. (The in-depth review, which presumably involved many hours of thought, does get as far as noting that probably Clara thinks that FOOM is needed to justify “you only get one shot”, but doesn’t really go into any depth or figure out what the repair would actually be.)
4. ^
  As a possible example, MacAskill quotes PC’s summary of EY as “you can’t learn anything about alignment from experimentation and failures before the critical try” but I think EY’s position is closer to “you can’t learn enough about alignment from experimentation and failures before the critical try”. Similarly see this tweet. I certainly believe that EY’s position is that you can’t learn enough, but did the author actually reflect on the various hopes for learning about alignment from experimentation and failures and updated their own beliefs, or did they note that there’s a clear rebuttal and then stopped thinking? (I legitimately don’t know tbc; though I’m happy to claim that often it’s more like the latter even if I don’t know in any individual case.)
5. ^
  During my PhD I was consistently irritated by how often peer reviewers would just completely fail to be moved by a conceptual argument. But arguably this is a feature, not a bug, along the lines of epistemic learned helplessness; if you stick to high standards of evidence that have worked well enough in the past, you’ll miss out on some real knowledge but you will be massively more resistant to incorrect-but-convincing arguments.
6. ^
  I was especially unimpressed about “enforcing norms on” (ie threatening) people if they don’t take the tribal action.
7. ^
  For example, “various readers may be less cautious/paranoid/afraid than me, and think that it’s worth some risk of killing every child on Earth (and everyone else) to get progress faster or to avoid the costs of getting everyone to go slow”. If you are arguing for > 90% doom “if anyone builds it”, you don’t need rhetorical jujitsu like this! (And in fact my sense is that many of the MIRI team who aren’t EY/NS equivocate a lot between “what’s needed for < 90% doom” and “what’s needed for < 1% doom”, though I’m not going to defend this claim. Seems like the sort of thing that could happen if you mindkill yourself this way.)
What links here?
- Lukas Finnveden's comment on Trapped Priors As A Basic Problem Of Rationality by Scott Alexander (24 Sep 2025 0:52 UTC; 3 points)

Rohin Shah 22 Sep 2025 9:34 UTC
17 points
−13
on: This is a review of the reviews
Building a coalition doesn’t look like suppressing disagreements, but it does look like building around the areas of agreement.
Indeed. This is why one might choose a different book title than “If Anyone Builds It, Everyone Dies”.
EDIT: On reflection, I retract my (implicit) claim that this is a symmetric situation; there is a difference between what you say unprompted, vs what you say when commenting on what someone else has said. It is of course still true that one might choose a different book title if the goal was to build around areas of agreement.

Rohin Shah 22 Sep 2025 7:41 UTC
7 points
2
in reply to: Buck’s comment on: The title is reasonable
it’s still the best place to talk about these issues
You surely mean “best public place” (which I’d agree with)?
I guess private conversations have more latency and are less rewarding in a variety of ways, but it would feel so surprising if this wasn’t addressable with small amounts of agency and/or money (e.g. set up Slack channels to strike up spur-of-the-moment conversations with people on different topics, give your planned post as a Constellation talk, set up regular video calls with thoughtful people, etc).

Rohin Shah 21 Sep 2025 16:41 UTC
22 points
2
in reply to: Lukas Finnveden’s comment on: The title is reasonable
- LW group epistemics have gotten worse since that post.
- I’m not sure if that post improved LW group epistemics very much in the long run. It certainly was a great post that I expect provided lots of value—but mostly to people who don’t post on LW nowadays, and so don’t affect (current) LW group epistemics much. Maybe Habryka is an exception.
- Even if it did, that’s the one counterexample that proves the rule, in the sense that I might agree for that post but probably not for any others, and I don’t expect more such posts to be made. Certainly I do not expect myself to actually produce a post of that quality.
- The post is mostly stating claims rather than arguing for them (the post itself says it is “Mostly stated without argument”) (though in practice it often gestures at arguments). I’m guessing it depended a fair bit on Paul’s existing reputation.
EDIT: Missed Raemon’s reply, I agree with at least the vibe of his comment (it’s a bit stronger than what I’d have said).
I’m interpreting your comments as being more pessimistic than just “not worth the opportunity cost”
Certainly I’m usually assessing most things based on opportunity cost, but yes I am notably more pessimistic than “not worth the opportunity cost”. I expect I passed that bar years ago, and since then the situation has gotten worse and my opportunity cost has gotten higher.
Perhaps as an example, I think Buck and Ryan are making a mistake spending so much time engaging with LW comments and perspectives that don’t seem to be providing any value to them (and so I infer they are aiming to improve the beliefs of readers), but I wouldn’t be too surprised if they would argue me out of that position if we discussed it for a couple of hours.
(Perhaps my original comment was a bit too dismissive in tone and implied it was worse than I actually think, though I stand by the literal meaning.)
EDIT: I should also note, I still have some hope that Lightcone somehow makes the situation better. I have no idea what they should do about it. But I do think that they are unusually good at noticing the relevant dynamics, and are way better than I am at using software to make discussions go better rather than worse, so perhaps they will figure something out. (Which is why I bothered to reply to Raemon’s post in the first place.)

Rohin Shah 21 Sep 2025 13:32 UTC
−1 points
−3
in reply to: Ishual’s comment on: Safety researchers should take a public stance
I conclude from this that you really do see this post as a threat (also you admitted there is no threat in your first comment so this comment now seems contradictory/bad-faith).
Sure, I’ll correct it to “an attempted threat by proxy is still an attempted threat”. (It’s not a threat just because you have nothing I care about to threaten me with, but it would be a threat if I did care about e.g. whether you respect me.)
But I agree that I am not trying to cooperate with you, if that’s what you mean by bad faith.

Rohin Shah 21 Sep 2025 10:02 UTC
1 point
−7
in reply to: Ishual’s comment on: Safety researchers should take a public stance
But the norm enforcement part is about what we think others (who are not necessarily working at frontier labs) should be doing.
A threat by proxy is still a threat.

Rohin Shah 21 Sep 2025 9:23 UTC
1 point
−13
on: Safety researchers should take a public stance
Moreover, this strategy does not involve any costly signals that would make the statement of intent credible. How can we know (at the point where we choose whether to enforce the norm), absent additional information, that making the lab’s outcome marginally better by being on the inside is their true motivation, where a similarly credible explanation would be that their actual motive (whether they are consciously aware of it or not) is something like a fun job with a good salary (monetary or paid in status), that can be justified by paying lip service to the threat models endorsed by those whose trust and validation they want (all of which are fine in themselves/isolation, but not justifying contributing to summoning a demon). To go even further with that, it allows people to remain strategically ambiguous, so as to make it possible for people of different views/affiliations to interpret the person as “one of my people”.
I would say that I aim not to give in to threats (or “norm enforcement”) but you don’t even have a threat! I think you may want to rethink your models of how norm enforcement works.

Rohin Shah 21 Sep 2025 8:17 UTC
16 points
0
in reply to: Raemon’s comment on: The title is reasonable
In the OP I’d been thinking more about sensationalism as a unilaterist cursey thing where the bad impacts were more about how they affect the global stage.
I did mean LW group epistemics. But the public has even worse group epistemics than LW, with an even higher sensationalism bias, so I don’t see how this is helping your case. Do you actually seriously think that, conditioned on Eliezer/Nate being wrong and me being right, that if I wrote up my arguments this would then meaningfully change the public’s group epistemics?
(I hadn’t even considered the possibility that you could mean writing up arguments for the public rather than for LW, it just seems so obviously doomed.)
sort of an instance of “sensationalist stuff distorting conversations”
Well yes, I have learned from experience that sensationalism is what causes change on LW, and I’m not very interested in spending effort on things that don’t cause change.
(Like, I could argue about all the things you get wrong on the object-level in the post. Such as “I don’t see any reason not to start pushing for a long global pause now”, I suppose it could be true that you can’t see a reason, but still, what a wild sentence to write. But what would be the point? It won’t allow for single-sentence takedowns suitable for Twitter, so no meaningful change would happen.)

Rohin Shah 21 Sep 2025 8:00 UTC
6 points
0
in reply to: the gears to ascension’s comment on: The title is reasonable
Correct, it wasn’t recent (though it also wasn’t a single decision, just a relatively continuous process whereby I engaged with fewer and fewer topics on LW as they seemed more and more doomed).
In terms of what caused me to give up, it’s just my experience engaging with LW? It’s not hard to see how tribalism and sensationalism drive LW group epistemics (on both the “optimist” and “pessimist” “sides”). Idk what the underlying causes are, I didn’t particularly try to find out. If I were trying to find out, I’d start by looking at changes after Death with Dignity was published.

Rohin Shah 20 Sep 2025 12:45 UTC
14 points
4
on: The title is reasonable
“Group epistemic norms” includes both how individuals reason, and how they present ideas to a larger group for deliberation.
[...]
I have the most sympathy for Complaint #3. I agree there’s a memetic bias towards sensationalism in outreach. (Although there are also major biases towards “normalcy” / “we’re gonna be okay” / “we don’t need to change anything major”. One could argue about which bias is stronger, but mostly I think they’re both important to model separately).
It does suck if you think something false is propagating. If you think that, seems good to write up what you think is true and argue about it.
Lol no. What’s the point of that? You’ve just agreed that there’s a bias towards sensationalism? Then why bother writing a less sensational argument that very few people will read and update on?
Personally, I just gave up on LW group epistemics. But if you actually cared about group epistemics, you should be treating the sensationalism bias as a massive fire, and IABIED definitely makes it worse rather than better.
(You can care about other things than group epistemics and defend IABIED on those grounds tbc.)

Rohin Shah 8 Sep 2025 10:47 UTC
2 points
0
in reply to: Cole Wyeth’s comment on: Trust me bro, just one more RL scale up, this one will be the real scale up with the good environments, the actually legit one, trust me bro
Afaict that is using the slower trend which includes GPT-2 and GPT-3, so I would actually consider 8 hours in March 2027 to be a slowdown using the methodology I state above (extrapolating the 2024-2025 trend, which is notably faster than the trend estimated on METR’s web page).
(Why am I adding this epicycle? Mostly because imo it tracks reality better, but also I believe this is how Ryan thinks about progress, and I was responding to Ryan.)