Neel Nanda comments on Neel Nanda’s Shortform

Neel Nanda 7 Dec 2025 17:30 UTC
22 points
0
I’m considering writing a follow-up FAQ to my pragmatic interpretability post, with clarifications and responses to common objections. What would you like to see addressed?
- J Bostock 8 Dec 2025 16:50 UTC
  10 points
  3
  Parent
  What is the alignment macro-strategy into which pragmatic interpretability fits?
  I can imagine macro-strategies where ambitious interpretability bears a heavy portion of the load e.g. retargeting the search; developing a theory of intelligence; mapping out the states of the Garrabrant market for a transformer model (idiosyncratic terminology in use for that last clause).
  I can also imagine ambitious interp as producing actual guarantees about models, like “we can be sure that this AI is honestly reporting its beliefs”.
  What’s the equivalent for pragmatic interpretability? Is it just a force multiplier to the existing strategies we have?
  Ambitious interp has the capability to flip the alignment game-board; I don’t see how pragmatic interpretability does.
- Thomas Kwa 8 Dec 2025 19:14 UTC
  7 points
  0
  Parent
  Should everyone do pragmatic interpretability, or are pragmatic interp and curiosity-driven basic science complementary? What should people do who are highly motivated by and have found success using the curiosity frame?
- anaguma 7 Dec 2025 23:21 UTC
  7 points
  2
  Parent
  How would you respond to Leo Gao’s recent post?
- Sheikh Abdur Raheem Ali 10 Dec 2025 2:23 UTC
  1 point
  0
  Parent
  I feel that there may be demand for a concrete open problems post. These kind of lists tend to be popular and the examples could be used by people picking projects to work on.

Neel Nanda comments on Neel Nanda’s Shortform

What is the alignment macro-strategy into which pragmatic interpretability fits?