Rohin Shah comments on A Pragmatic Vision for Interpretability

Rohin Shah 5 Dec 2025 6:19 UTC
10 points
1
Makes sense, I still endorse my original comment in light of this answer (as I already expected something like this was your view). Like, I would now say
Imo the vast, vast majority of progress in the world happens via “engineering-type / category 2” approaches, so if you do think you can win via “engineering-type / category 2″ approaches you should generally bias towards them
while also noting that the way we are using the phrase “engineering-type” here includes a really large amount of what most people would call “science” (e.g. it includes tons of academic work), so it is important when evaluating this claim to interpret the words “engineering” and “science” in context rather than via their usual connotations.
- Richard_Ngo 5 Dec 2025 20:23 UTC
  15 points
  1
  Parent
  Yepp, makes sense, and it’s a good reminder for me to be careful about how I use these terms.
  One clarification I’d make to your original comment though is that I don’t endorse “you have to deeply understand intelligence from first principles else everyone dies”. My position is closer to “you have to be trying to do something principled in order for your contribution to be robustly positive”. Relatedly, agent foundations and mech-interp are approximately the only two parts of AI safety that seem robustly good to me—with a bunch of other stuff like RLHF, or evals, or (almost all) governance work, I feel pretty confused about whether they’re good or bad or basically just wash out even in expectation.
  This is still consistent with risk potentially being reduced by what I call engineering-type work, it’s just that IMO that involves us “getting lucky” in an important way which I prefer we not rely on. (And trying to get lucky isn’t a neutral action—engineering-type work can also easily have harmful effects.)
  What links here?
  - Rohin Shah's comment on A Pragmatic Vision for Interpretability by Neel Nanda (3 Dec 2025 9:51 UTC; 43 points)
  - Rohin Shah 6 Dec 2025 8:58 UTC
    6 points
    0
    Parent
    Fair, I’ve edited the comment with a pointer. It still seems to me to be a pretty direct disagreement with “we can substantially reduce risk via [engineering-type / category 2] approaches”.
    My claim is “while it certainly could be net negative (as is also the case for ~any action including e.g. donating to AMF), in aggregate it is substantially positive expected risk reduction”.
    Your claim in opposition seems to be “who knows what the sign is, we should treat it as an expected zero risk reduction”.
    Though possibly you are saying “it’s bad to take actions that have a chance of backfiring, we should focus much more on robustly positive things” (because something something virtue ethics?), in which case I think we have a disagreement on decision theory instead.
    I still want to claim that in either case, my position is much more common (among the readership here), except inasmuch as they disagree because they think alignment is very hard and that’s why there’s expected zero (or negative) risk reduction. And so I wish you’d flag when your claims depend on these takes (though I realize it is often hard to notice when that is the case).
    - Richard_Ngo 7 Dec 2025 23:59 UTC
      10 points
      0
      Parent
      I expect it’s not worth our time to dig too deep into whose position is more common here. But I think that a lot of people on LW have high P(doom) in significant part because they share my intuition that marginalist approaches don’t reliably work. I do agree that my combination of “marginalist approaches don’t reliably improve things” and “P(doom) is <50%” is a rare one, but I was only making the former point above (and people upvoted it accordingly), so it feels a bit misleading to focus on the rareness of the overall position.
      (Interestingly, while the combination I describe above is a rare one, the converse is also rare—Daniel Kokotajlo is the only person who comes to mind who disagrees with me on both of these propositions simultaneously. Note that he doesn’t characterize his current work as marginalist, but even aside from that question I think this characterization of him is accurate—e.g. he has talked to me about how changing the CEO of a given AI lab could swing his P(doom) by double digit percentage points.)
      - Rohin Shah 8 Dec 2025 11:11 UTC
        7 points
        5
        Parent
        On reflection, it’s not actually about which position is more common. My real objection is that imo it was pretty obvious that something along these lines would be the crux between you and Neel (and the fact that it is a common position is part of why I think it was obvious).
        Inasmuch as you are actually trying to have a conversation with Neel or address Neel’s argument on its merits, it would be good to be clear that this is the crux. I guess perhaps you might just not care about that and are instead trying to influence readers without engaging with the OP’s point of view, in which case fair enough. Personally I would find that distasteful / not in keeping with my norms around collective-epistemics but I do admit it’s within LW norms.
        (Incidentally, I feel like you still aren’t quite pinning down your position—depending on what you mean by “reliably” I would probably agree with “marginalist approaches don’t reliably improve things”. I’d also agree with “X doesn’t reliably improve things” for almost any interesting value of X.)
        Richard_Ngo 12 Dec 2025 17:07 UTC
        7 points
        0
        Parent
        Inasmuch as you are actually trying to have a conversation with Neel or address Neel’s argument on its merits, it would be good to be clear that this is the crux.
        The first two paragraphs of my original comment were trying to do this. The rest wasn’t. I flagged this in the sentence “The rest of my comment isn’t directly about this post, but close enough that this seems like a reasonable place to put it.” However, I should have been clearer about the distinction. I’ve now added the following:
        EDIT: to be more clear: the rest of this comment is not primarily about Neel or “pragmatic interpretability”, it’s about parts of the field that I consider to be significantly less relevant to “solving alignment” than that (though work that’s nominally on pragmatic interpretability could also fall into the same failure modes). I clarify my position further in this comment; thanks Rohin for the pushback.
        Reflecting further, I think there are two parts of our earlier exchange that are a bit suspicious. The first is when I say that everyone seems to have “given up” (rather than something more nuanced like “given up on tackling the most fundamental aspects of the problem”). The second is where you summarize my position as being that we need deep scientific understanding or else everyone dies (which I think you can predict is a pretty unlikely position for me in particular to hold).
        So what’s going on here? It feels like we’re both being “anchored” by extreme positions. You were rounding me off to doomerism, and I was rounding the marginalists off to “giving up”. Both I’d guess are artifacts of writing quickly and a bit frustratedly. Probably I should write a full post or shortform that characterizes more precisely what “giving up” is trying to point to.
        (Incidentally, I feel like you still aren’t quite pinning down your position—depending on what you mean by “reliably” I would probably agree with “marginalist approaches don’t reliably improve things”. I’d also agree with “X doesn’t reliably improve things” for almost any interesting value of X.)
        My instinctive reaction is that this depends a lot on whether by “marginalist approaches” we mean something closer to “a single marginalist approach” or “the set of all people pursuing marginalist approaches”. I think we both agree that no single marginalist approach (e.g. investigating a given technique) makes reliable progress. However, I’d guess that I’m more willing than you to point to a broad swathe of people pursuing marginalist approaches and claim that they won’t reliably improve things.
        Rohin Shah 12 Dec 2025 19:28 UTC
        5 points
        0
        Parent
        The first two paragraphs of my original comment were trying to do this.
        (I have the same critique of the first two paragraphs, but thanks for the edit, it helps)
        The second is where you summarize my position as being that we need deep scientific understanding or else everyone dies (which I think you can predict is a pretty unlikely position for me in particular to hold).
        Fwiw, I am actively surprised that you have a p(doom) < 50%, I can name several lines of evidence in the opposite direction:
        You’ve previously tried to define alignment based on worst-case focus and scientific approach. This suggests you believe that “marginalist” / “engineering” approaches are ~useless, from which I inferred (incorrectly) that you would have a high p(doom).
        I still find the conjunction of the two positions you hold pretty weird.
        I’m a strong believer in logistic success curves for complex situations. If you’re in the middle part of a logistic success curve in a complex situation, then there should be many things that can be done to improve the situation, and it seems like “engineering” approaches should work.
        It’s certainly possible to have situations that prevent this. Maybe you have a bimodal distribution, e.g. 70% on “near-guaranteed fine by default” and 30% on “near-guaranteed doom by default”. Maybe you think that people have approximately zero ability to tell which things are improvements. Maybe you think we are at the far end of the logistic success curve today, but timelines are long and we’ll do the necessary science in time. But these views seem kinda exotic and unlikely to be someone’s actual views. (Idk maybe you do believe the second one.)
        Obviously I had not thought through this in detail when I originally wrote my comment, and my wordless inference was overconfident in hindsight. But I stand by my overall sense that a person who thinks “engineering” approaches are near-useless will likely also have high p(doom) -- not just as a sociological observation, but also as a claim about which positions are consistent with each other.
        In your writing you sometimes seem to take as a background assumption that alignment will be very hard. For example, I recall you critiquing assistance games because (my paraphrase) “that’s not what progress on a hard problem looks like”. (I failed to dig up the citation though.)
        You’re generally taking a strategy that appears to me to be high variance, which people usually justify via high p(doom) / playing to your outs.
        A lot of your writing is similarly flavored to other people who have high p(doom).
        In terms of evidence that you have a p(doom) < 50%, I think the main thing that comes to mind is that you argued against Eliezer about this in late 2021, but that was quite a while ago (relative to the evidence above) and I thought you had changed your mind. (Also iirc the stuff you said then was consistent with p(doom) ~ 50%, but it’s long enough ago that I could easily be forgetting things.)
        However, I’d guess that I’m more willing than you to point to a broad swathe of people pursuing marginalist approaches and claim that they won’t reliably improve things.
        You could point to ~any reasonable subcommunity within AI safety (or the entire community) and I’d still be on board with the claim that there’s at least a 10% chance that will make things worse, which I might summarize as “they won’t reliably improve things”, so I still feel like this isn’t quite capturing the distinction. (I’d include communities focused on “science” in that, but I do agree that they are more likely not to have a negative sign.) So I still feel confused about what exactly your position is.