This also reminds me of the history of introspection in human psychology.
Human verbal explanations are often behaviorally predictive without accurately reflecting the underlying causal cognition. People frequently confabulate or compress complex unconscious processes into simple narratives.
NLAs may end up functioning as a kind of “folk psychology for transformers”: useful for prediction and coarse characterization, but not necessarily faithful to the actual mechanistic computation underneath.
That still could be extremely valuable — human folk psychology is often quite predictive — but it suggests caution about treating natural-language explanations as transparent windows into model cognition.
This also reminds me of the history of introspection in human psychology.
Human verbal explanations are often behaviorally predictive without accurately reflecting the underlying causal cognition. People frequently confabulate or compress complex unconscious processes into simple narratives.
NLAs may end up functioning as a kind of “folk psychology for transformers”: useful for prediction and coarse characterization, but not necessarily faithful to the actual mechanistic computation underneath.
That still could be extremely valuable — human folk psychology is often quite predictive — but it suggests caution about treating natural-language explanations as transparent windows into model cognition.