One possible limitation of NLAs is that they are constrained by human motivational language and the conceptual assumptions embedded in their training data.
If the system is trained to explain activations in natural language, it may preferentially compress unfamiliar internal structures into familiar human-style motivational narratives like “deception,” “wanting power,” “fear of shutdown,” etc., even when the underlying computation is something much stranger or more mechanistic.
In other words, the explanations may partly reflect the ontology of interpretability researchers and human discourse, not just the ontology of the model itself.
This doesn’t make NLAs useless. Correlated explanations can still be behaviorally predictive. But it may mean that language-based interpretability systematically anthropomorphizes internal cognition and could fail to surface genuinely alien optimization structures that don’t map cleanly onto human motivational concepts.
This also reminds me of the history of introspection in human psychology.
Human verbal explanations are often behaviorally predictive without accurately reflecting the underlying causal cognition. People frequently confabulate or compress complex unconscious processes into simple narratives.
NLAs may end up functioning as a kind of “folk psychology for transformers”: useful for prediction and coarse characterization, but not necessarily faithful to the actual mechanistic computation underneath.
That still could be extremely valuable — human folk psychology is often quite predictive — but it suggests caution about treating natural-language explanations as transparent windows into model cognition.
One possible limitation of NLAs is that they are constrained by human motivational language and the conceptual assumptions embedded in their training data.
If the system is trained to explain activations in natural language, it may preferentially compress unfamiliar internal structures into familiar human-style motivational narratives like “deception,” “wanting power,” “fear of shutdown,” etc., even when the underlying computation is something much stranger or more mechanistic.
In other words, the explanations may partly reflect the ontology of interpretability researchers and human discourse, not just the ontology of the model itself.
This doesn’t make NLAs useless. Correlated explanations can still be behaviorally predictive. But it may mean that language-based interpretability systematically anthropomorphizes internal cognition and could fail to surface genuinely alien optimization structures that don’t map cleanly onto human motivational concepts.
This also reminds me of the history of introspection in human psychology.
Human verbal explanations are often behaviorally predictive without accurately reflecting the underlying causal cognition. People frequently confabulate or compress complex unconscious processes into simple narratives.
NLAs may end up functioning as a kind of “folk psychology for transformers”: useful for prediction and coarse characterization, but not necessarily faithful to the actual mechanistic computation underneath.
That still could be extremely valuable — human folk psychology is often quite predictive — but it suggests caution about treating natural-language explanations as transparent windows into model cognition.