I’m considering writing a follow-up FAQ to my pragmatic interpretability post, with clarifications and responses to common objections. What would you like to see addressed?
What is the alignment macro-strategy into which pragmatic interpretability fits?
I can imagine macro-strategies where ambitious interpretability bears a heavy portion of the load e.g. retargeting the search; developing a theory of intelligence; mapping out the states of the Garrabrant market for a transformer model (idiosyncratic terminology in use for that last clause).
I can also imagine ambitious interp as producing actual guarantees about models, like “we can be sure that this AI is honestly reporting its beliefs”.
What’s the equivalent for pragmatic interpretability? Is it just a force multiplier to the existing strategies we have?
Ambitious interp has the capability to flip the alignment game-board; I don’t see how pragmatic interpretability does.
Should everyone do pragmatic interpretability, or are pragmatic interp and curiosity-driven basic science complementary? What should people do who are highly motivated by and have found success using the curiosity frame?
I feel that there may be demand for a concrete open problems post. These kind of lists tend to be popular and the examples could be used by people picking projects to work on.
I’m considering writing a follow-up FAQ to my pragmatic interpretability post, with clarifications and responses to common objections. What would you like to see addressed?
What is the alignment macro-strategy into which pragmatic interpretability fits?
I can imagine macro-strategies where ambitious interpretability bears a heavy portion of the load e.g. retargeting the search; developing a theory of intelligence; mapping out the states of the Garrabrant market for a transformer model (idiosyncratic terminology in use for that last clause).
I can also imagine ambitious interp as producing actual guarantees about models, like “we can be sure that this AI is honestly reporting its beliefs”.
What’s the equivalent for pragmatic interpretability? Is it just a force multiplier to the existing strategies we have?
Ambitious interp has the capability to flip the alignment game-board; I don’t see how pragmatic interpretability does.
Should everyone do pragmatic interpretability, or are pragmatic interp and curiosity-driven basic science complementary? What should people do who are highly motivated by and have found success using the curiosity frame?
How would you respond to Leo Gao’s recent post?
I feel that there may be demand for a concrete open problems post. These kind of lists tend to be popular and the examples could be used by people picking projects to work on.