This is only the case if the system that is doing the optimisation is in control of the system that provides the world model/does the interpretation. Language models don’t seem to have an incentive to push words away from their correct meanings. They are not agents and don’t have goals beyond their simulation objective (insomuch as they are “inner aligned”).
If the system that’s optimising for human goals doesn’t control the system that interprets said goals, I don’t think an issue like this will arise.
If the system that’s optimising is separate from the system that has the linguistic output, then there’s a huge issue with the optimising system manipulating or fooling the linguistic system—another kind of “symbol grounding failure”.
This is only the case if the system that is doing the optimisation is in control of the system that provides the world model/does the interpretation. Language models don’t seem to have an incentive to push words away from their correct meanings. They are not agents and don’t have goals beyond their simulation objective (insomuch as they are “inner aligned”).
If the system that’s optimising for human goals doesn’t control the system that interprets said goals, I don’t think an issue like this will arise.
If the system that’s optimising is separate from the system that has the linguistic output, then there’s a huge issue with the optimising system manipulating or fooling the linguistic system—another kind of “symbol grounding failure”.