A good deal of this post is correct. But the goals of language models are more complex than you admit, and not fully specified by natural language. LLMs do something that’s approximately a simulation of a human. Those simulated quasi-humans are likely to have quasi-human goals that are unstated and tricky to observe, for much the same reasons that humans have such goals.
LLMs also have goals that influence what kind of human they simulate. We’ll know approximately what those goals are, due to our knowledge of what generated those goals. But how do we tell whether approximately is good enough?
A good deal of this post is correct. But the goals of language models are more complex than you admit, and not fully specified by natural language. LLMs do something that’s approximately a simulation of a human. Those simulated quasi-humans are likely to have quasi-human goals that are unstated and tricky to observe, for much the same reasons that humans have such goals.
LLMs also have goals that influence what kind of human they simulate. We’ll know approximately what those goals are, due to our knowledge of what generated those goals. But how do we tell whether approximately is good enough?