It might also fail for more mundane reasons, such as imitating agentic behavior in the training distribution.
I think what you’re saying here is that you’d see it behaving non-myopically because it’s simulating a consequentialist agent, correct? This difficulty seems to me like a pretty central blocker to doing these kinds of tests on LLMs. It’s not clear to me that at this point we have any way of distinguishing ‘model behavior’ from simulation (one possibility might be looking for something like an attractor, where a surprisingly wide range of prompts result in a particular behavior).
(Of course I realize that there are plenty of cases where simulated behavior has real-world consequences. But for testing whether a model ‘is’ non-myopic, this seems like an important problem)
You could try to do tests on data that is far enough from the training distribution that it won’t generalize in a simple immitative way there, and you could do tests to try and confirm that you are far enough off distribution. For instance, perhaps using a carefully chosen invented language would work.
I think what you’re saying here is that you’d see it behaving non-myopically because it’s simulating a consequentialist agent, correct? This difficulty seems to me like a pretty central blocker to doing these kinds of tests on LLMs. It’s not clear to me that at this point we have any way of distinguishing ‘model behavior’ from simulation (one possibility might be looking for something like an attractor, where a surprisingly wide range of prompts result in a particular behavior).
(Of course I realize that there are plenty of cases where simulated behavior has real-world consequences. But for testing whether a model ‘is’ non-myopic, this seems like an important problem)
You could try to do tests on data that is far enough from the training distribution that it won’t generalize in a simple immitative way there, and you could do tests to try and confirm that you are far enough off distribution. For instance, perhaps using a carefully chosen invented language would work.