Rohin Shah comments on A Pragmatic Vision for Interpretability

Rohin Shah 12 Dec 2025 19:28 UTC
5 points
0
The first two paragraphs of my original comment were trying to do this.
(I have the same critique of the first two paragraphs, but thanks for the edit, it helps)
The second is where you summarize my position as being that we need deep scientific understanding or else everyone dies (which I think you can predict is a pretty unlikely position for me in particular to hold).
Fwiw, I am actively surprised that you have a p(doom) < 50%, I can name several lines of evidence in the opposite direction:
- You’ve previously tried to define alignment based on worst-case focus and scientific approach. This suggests you believe that “marginalist” / “engineering” approaches are ~useless, from which I inferred (incorrectly) that you would have a high p(doom).
  - I still find the conjunction of the two positions you hold pretty weird.
  - I’m a strong believer in logistic success curves for complex situations. If you’re in the middle part of a logistic success curve in a complex situation, then there should be many things that can be done to improve the situation, and it seems like “engineering” approaches should work.
  - It’s certainly possible to have situations that prevent this. Maybe you have a bimodal distribution, e.g. 70% on “near-guaranteed fine by default” and 30% on “near-guaranteed doom by default”. Maybe you think that people have approximately zero ability to tell which things are improvements. Maybe you think we are at the far end of the logistic success curve today, but timelines are long and we’ll do the necessary science in time. But these views seem kinda exotic and unlikely to be someone’s actual views. (Idk maybe you do believe the second one.)
  - Obviously I had not thought through this in detail when I originally wrote my comment, and my wordless inference was overconfident in hindsight. But I stand by my overall sense that a person who thinks “engineering” approaches are near-useless will likely also have high p(doom) -- not just as a sociological observation, but also as a claim about which positions are consistent with each other.
- In your writing you sometimes seem to take as a background assumption that alignment will be very hard. For example, I recall you critiquing assistance games because (my paraphrase) “that’s not what progress on a hard problem looks like”. (I failed to dig up the citation though.)
- You’re generally taking a strategy that appears to me to be high variance, which people usually justify via high p(doom) / playing to your outs.
- A lot of your writing is similarly flavored to other people who have high p(doom).
In terms of evidence that you have a p(doom) < 50%, I think the main thing that comes to mind is that you argued against Eliezer about this in late 2021, but that was quite a while ago (relative to the evidence above) and I thought you had changed your mind. (Also iirc the stuff you said then was consistent with p(doom) ~ 50%, but it’s long enough ago that I could easily be forgetting things.)
However, I’d guess that I’m more willing than you to point to a broad swathe of people pursuing marginalist approaches and claim that they won’t reliably improve things.
You could point to ~any reasonable subcommunity within AI safety (or the entire community) and I’d still be on board with the claim that there’s at least a 10% chance that will make things worse, which I might summarize as “they won’t reliably improve things”, so I still feel like this isn’t quite capturing the distinction. (I’d include communities focused on “science” in that, but I do agree that they are more likely not to have a negative sign.) So I still feel confused about what exactly your position is.