J Bostock comments on Neel Nanda’s Shortform

J Bostock 8 Dec 2025 16:50 UTC
10 points
3
What is the alignment macro-strategy into which pragmatic interpretability fits?
I can imagine macro-strategies where ambitious interpretability bears a heavy portion of the load e.g. retargeting the search; developing a theory of intelligence; mapping out the states of the Garrabrant market for a transformer model (idiosyncratic terminology in use for that last clause).
I can also imagine ambitious interp as producing actual guarantees about models, like “we can be sure that this AI is honestly reporting its beliefs”.
What’s the equivalent for pragmatic interpretability? Is it just a force multiplier to the existing strategies we have?
Ambitious interp has the capability to flip the alignment game-board; I don’t see how pragmatic interpretability does.