“The structure of synchronization is, in general, richer than the world model itself. In this sense, LLMs learn more than a world model” given that I expect this is the statement that will catch a lot of people’s attention.
Just in case this claim caught anyone else’s attention, what they mean by this is that it contains: • A model of the world • A model of the agent’s process for updating its belief about which state the world is in
I am trying to wrap my head around the high-level implications of this statement. I can come up with two interpretations:
What LLMs are doing is similar to what people do as they go about their day. When I walk down the street, I am simultaneously using visual and other input to assess the state of the world around me (“that looks like a car”), running a world model based on that assessment (“the car is coming this way”), and then using some other internal mechanism to decide what to do (“I’d better move to the sidewalk”).
What LLMs are doing is harder than what people do. When I converse with someone, I have some internal state, and I run some process in my head – based on that state – to generate my side of the conversation. When an LLM converses with someone, instead of maintaining internal state, needs to maintain a probability distribution over possible states, make next-token predictions according to that distribution, and simultaneously update the distribution.
(2) seems more technically correct, but my intuition dislikes the conclusion, for reasons I am struggling to articulate. …aha, I think this may be what is bothering me: I have glossed over the distinction between input and output tokens. When an LLM is processing input tokens, it is working to synchronize its state to the state of the generator. Once it switches to output mode, there is no functional benefit to continuing to synchronize state (what is it synchronizing to?), so ideally we’d move to a simpler neural net that does not carry the weight of needing to maintain and update a probability distribution of possible states. (Glossing over the fact that LLMs as used in practice sometimes need to repeatedly transition between input and output modes.) LLMs need the capability to ease themselves into any conversation without knowing the complete history of the participant they are emulating, while people have (in principle) access to their own complete history and so don’t need to be able to jump into a random point in their life and synchronize state on the fly.
So the implication is that the computational task faced by an LLM which can emulate Einstein is harder than the computational task of being Einstein… is that right? If so, that in turn leads to the question of whether there are alternative modalities for AI which have the advantages of LLMs (lots of high-quality training data) but don’t impose this extra burden. It also raises the question of how substantial this burden is in practice, in particular for leading-edge models.
“The structure of synchronization is, in general, richer than the world model itself. In this sense, LLMs learn more than a world model” given that I expect this is the statement that will catch a lot of people’s attention.
Just in case this claim caught anyone else’s attention, what they mean by this is that it contains:
• A model of the world
• A model of the agent’s process for updating its belief about which state the world is in
I am trying to wrap my head around the high-level implications of this statement. I can come up with two interpretations:
What LLMs are doing is similar to what people do as they go about their day. When I walk down the street, I am simultaneously using visual and other input to assess the state of the world around me (“that looks like a car”), running a world model based on that assessment (“the car is coming this way”), and then using some other internal mechanism to decide what to do (“I’d better move to the sidewalk”).
What LLMs are doing is harder than what people do. When I converse with someone, I have some internal state, and I run some process in my head – based on that state – to generate my side of the conversation. When an LLM converses with someone, instead of maintaining internal state, needs to maintain a probability distribution over possible states, make next-token predictions according to that distribution, and simultaneously update the distribution.
(2) seems more technically correct, but my intuition dislikes the conclusion, for reasons I am struggling to articulate. …aha, I think this may be what is bothering me: I have glossed over the distinction between input and output tokens. When an LLM is processing input tokens, it is working to synchronize its state to the state of the generator. Once it switches to output mode, there is no functional benefit to continuing to synchronize state (what is it synchronizing to?), so ideally we’d move to a simpler neural net that does not carry the weight of needing to maintain and update a probability distribution of possible states. (Glossing over the fact that LLMs as used in practice sometimes need to repeatedly transition between input and output modes.) LLMs need the capability to ease themselves into any conversation without knowing the complete history of the participant they are emulating, while people have (in principle) access to their own complete history and so don’t need to be able to jump into a random point in their life and synchronize state on the fly.
So the implication is that the computational task faced by an LLM which can emulate Einstein is harder than the computational task of being Einstein… is that right? If so, that in turn leads to the question of whether there are alternative modalities for AI which have the advantages of LLMs (lots of high-quality training data) but don’t impose this extra burden. It also raises the question of how substantial this burden is in practice, in particular for leading-edge models.