Haha, I was about to post a comment much like balioc’s when first reading you writing rather descriptively and without much qualification on how the LM models “speculative interior states” and “actions”, before thinking through pretty much exactly what you wrote in reply and deciding you probably meant it more as a human mental model than a statement on interpretability.
Though I think point 2 (the intentional stance again – except this time applied to the language model) is still understating how imperfect the mental model is. In chess, “‘Oh, they probably know I’m planning to do that,’ and such things” are rather amateur things to think about, and better players actually do use completely impersonal mental models that only depend on the game state, since there’s perfect information and you can’t rely on your opponent making mistakes. Even in an imperfect information game like poker, experienced players are modeling the game as an impersonal probabilistic system, with terms like “bluffing” just shorthand for deviations from a certain statistical basis (like GTO play).
I suspect there will be things analogous to this for thinking about LLMs, and other things that we tend to model from the intentional stance without better alternatives. But as you say, an internalities-based model is probably close to the best we can do for now, and it’s quite possible any alternative future mental models wouldn’t even be intuitively feasible like empathy is (at least without a ton of practice).
Haha, I was about to post a comment much like balioc’s when first reading you writing rather descriptively and without much qualification on how the LM models “speculative interior states” and “actions”, before thinking through pretty much exactly what you wrote in reply and deciding you probably meant it more as a human mental model than a statement on interpretability.
Though I think point 2 (the intentional stance again – except this time applied to the language model) is still understating how imperfect the mental model is. In chess, “‘Oh, they probably know I’m planning to do that,’ and such things” are rather amateur things to think about, and better players actually do use completely impersonal mental models that only depend on the game state, since there’s perfect information and you can’t rely on your opponent making mistakes. Even in an imperfect information game like poker, experienced players are modeling the game as an impersonal probabilistic system, with terms like “bluffing” just shorthand for deviations from a certain statistical basis (like GTO play).
I suspect there will be things analogous to this for thinking about LLMs, and other things that we tend to model from the intentional stance without better alternatives. But as you say, an internalities-based model is probably close to the best we can do for now, and it’s quite possible any alternative future mental models wouldn’t even be intuitively feasible like empathy is (at least without a ton of practice).