Dan H comments on A Pragmatic Vision for Interpretability

Dan H 1 Dec 2025 17:18 UTC
LW: 18 AF: 4
2
AF
Thank you to Neel for writing this. Most people pivot quietly.

I’ve been most skeptical of mechanistic interpretability for years. I excluded interpretability in Unsolved Problems in ML Safety for this reason. Other fields like d/acc (Systemic Safety) were included though, all the way back in 2021.

Here’s are some earlier criticisms: https://www.lesswrong.com/posts/5HtDzRAk7ePWsiL2L/open-problems-in-ai-x-risk-pais-5#Transparency

More recent commentary: https://ai-frontiers.org/articles/the-misguided-quest-for-mechanistic-ai-interpretability

I think the community should reflect on its genius worship culture (in the case of Olah, a close friend of the inner circle) and epistemics: the approach was so dominant for years, and I think this outcome was entirely foreseeable.
- TristanTrim 2 Dec 2025 9:51 UTC
  1 point
  −23
  Parent
  [ edit: I’m pretty surprised by the amount people disagree with this comment. Would appreciate if people disagreeing could use the ✓ or 🗶 to indicate what you disagree with. ]
  
  I think the community should reflect on its genius worship culture
  
  I feel this is a valid critique not just of our research community, but of society in general. It is the great man theory of history, and I believe modern sociology has found the theory mostly invalid. However, I believe hero worship plays an important role in building and modelling ones own personality and motivation. I think it is healthy psychologically to worship heroes, but is probably better done through thinking of characters who we aspire to be like, rather than focusing too much on popular individuals and their ideas.
  
  The critique also likely extends past people to ideas in general. I feel EA has done well by popularizing the concept of “neglectedness” of cause areas. This idea of trying to find and explore neglected areas seems worthwhile, though must be balanced against the need to have sufficiently well established shared context.
  - Noosphere89 2 Dec 2025 18:10 UTC
    12 points
    14
    Parent
    I feel this is a valid critique not just of our research community, but of society in general. It is the great man theory of history, and I believe modern sociology has found the theory mostly invalid.
    
    I want to flag here that the version of great man theory that was debunked by modern sociology is the claim that big impacts on the world are always/almost always are caused by great men, not that great men can’t have big impacts on the world.
    For what it’s worth, I actually disagree with this view, and think that one of the bigger things LW gets right is that people’s impact in a lot of domains is pretty heavy-tailed, and certain things matter way more than others under their utility function.
    I do agree that people can round the impact off to infinity for rare geniuses, and there is a point to be made about LWers overvaluing theory/curiosity driven tasks compared to just using simple baselines and doing what works (and I agree with this critique), but the appreciation of heavy-tailed impact is one of the things I most value about LW, and while there are problems that do stem from this, I also think it’s important not to damage the appreciation of heavy-tailed impact too much in solving the problems (assuming the heavy-tailed hypothesis is true, which I largely believe).
    - TristanTrim 2 Dec 2025 19:15 UTC
      1 point
      0
      Parent
      Thanks! I think this is a valuable clarification. I should maybe have flagged that while I value sociology, I’m not deeply experienced in it.
      
      From my own thinking, it seems there are two interesting properties here, (A) the counterfactual history supposing that some person did or did not exist and the extent to which history is significantly changed, and (B) the amount that someones name gets attached to things and things get attributed to them.
      
      So A is “great” and B is “well known”...
      
      ¬A¬B is just the normal case of a person living a modest life and not having a significant impact on history. Nothing to see here.
      AB is the great man from great man theory. They do great things and everyone recognizes it!
      A¬B is when someone has a large impact, but does not become well known for it. I suspect this does happen. Probably rarely with someone being completely unknown, but very possibly having much larger impacts than renown.
      ¬AB is when someone becomes known for something while not actually having impacted history much. I suspect this is common when paradigm shifts are inevitable but need something to precipitate around, this could be as much as needing someone to “champion” the paradigm shift, or could be as simple as needing a label to put on a commonly expressed idea and happening to use the name of someone who at some point expressed that idea.
      
      If I had to guess, I would suspect that ¬AB is more common than AB, and A¬B would be less common than either, and ¬A¬B is massively more common than all the others, but of course they are not boolean but gradients. If you (or anyone) know of this distribution being studied, I would be interested to hear about it.
      
      Another nearby idea is that A type people can have large impacts because of where they were born in society. For example, Alexander the great wouldn’t have been nearly so great had he not been the son of the king of Macedon. I think this is actually a pretty different question though. “Did Alex have a large impact?” is a different question than “How was Alex able to have such a large impact?”. He couldn’t have commanded armies if he had had no armies to command, but if nobody else would have commanded them to do what he did, then he still had a large impact. So this idea seems more about whether societies can be seen as meritocratic VS being more nepotistic or cronyistic. Worthwhile to know, but very different.
      
      I’m also interested in motivations people have for wondering about this. I could see some using it to argue that status inequality is greater than justified, and while I think this is probably true, I don’t really care that much. More interesting to me is the question “how should one attempt to influence the world for the better”. In an A heavy world, it makes sense to try to be great. In a ¬A heavy world, it makes more sense to try to find the movements and projects and communities that are going to have positive impact and contribute to them.
      
      It is probably important to have people trying all combinations of both, but I feel I naturally lean heavily to A heavy individualism, thinking highly of my own ideas while undervaluing other peoples ideas, so for me, thinking of the world as ¬A heavy feels like it helps me compensate and focus more on collaboration. However, having put this into words, I would probably be better off trying to believe whatever is true and trying to promote collaboration within myself, if it is helpful, regardless. So again if anyone can point me at actual studies of this, that would be pretty cool!