I don’t think there’s actually an asterisk. My naive/uninformed opinion is that the idea that LLMs don’t actually learn a map of the world is very silly.
The algorithm might have a correct map of the world, but if its goals are phrased in terms of words, it will have a pressure to push those words away from their correct meanings. “Ensure human flourishing” is much easier if you can slide those words towards other meanings.
This is only the case if the system that is doing the optimisation is in control of the system that provides the world model/does the interpretation. Language models don’t seem to have an incentive to push words away from their correct meanings. They are not agents and don’t have goals beyond their simulation objective (insomuch as they are “inner aligned”).
If the system that’s optimising for human goals doesn’t control the system that interprets said goals, I don’t think an issue like this will arise.
If the system that’s optimising is separate from the system that has the linguistic output, then there’s a huge issue with the optimising system manipulating or fooling the linguistic system—another kind of “symbol grounding failure”.
The algorithm might have a correct map of the world, but if its goals are phrased in terms of words, it will have a pressure to push those words away from their correct meanings. “Ensure human flourishing” is much easier if you can slide those words towards other meanings.
This is only the case if the system that is doing the optimisation is in control of the system that provides the world model/does the interpretation. Language models don’t seem to have an incentive to push words away from their correct meanings. They are not agents and don’t have goals beyond their simulation objective (insomuch as they are “inner aligned”).
If the system that’s optimising for human goals doesn’t control the system that interprets said goals, I don’t think an issue like this will arise.
If the system that’s optimising is separate from the system that has the linguistic output, then there’s a huge issue with the optimising system manipulating or fooling the linguistic system—another kind of “symbol grounding failure”.