Assuming people accept the model that LLM behavior is primarily determined by modeling the behavior of some subset (such that fine-tuning works primarily by shaping the subset of humans that the model emulates) of the human writers it was trained on, it might be simplest to ask whether the model “behaves like a person who believes X”.
This framing carries practical benefits (again, so long as you agree with the assumption above), in that the fine-tuning paradigm can be examined in the context of identifying what causes the model to upweight, say, the “a black-hat hacker is writing this document” neuron, and check this work against human data. This has already shown experimental success in the opposite direction—if you train a model to upweight the “Hitler is writing this document” neuron, the model will write as if it believes itself to be Hitler.
I agree that there are a lot of interesting questions about how to think about the subset of training data authors and characters that shape a particular model response, and I think it would be an interesting project to try to define a useful metric on that. I see that as separate from Chalmers’ coinage here, though, which is more about specifying what questions we’re not trying to answer.
Assuming people accept the model that LLM behavior is primarily determined by modeling the behavior of some subset (such that fine-tuning works primarily by shaping the subset of humans that the model emulates) of the human writers it was trained on, it might be simplest to ask whether the model “behaves like a person who believes X”.
This framing carries practical benefits (again, so long as you agree with the assumption above), in that the fine-tuning paradigm can be examined in the context of identifying what causes the model to upweight, say, the “a black-hat hacker is writing this document” neuron, and check this work against human data. This has already shown experimental success in the opposite direction—if you train a model to upweight the “Hitler is writing this document” neuron, the model will write as if it believes itself to be Hitler.
I agree that there are a lot of interesting questions about how to think about the subset of training data authors and characters that shape a particular model response, and I think it would be an interesting project to try to define a useful metric on that. I see that as separate from Chalmers’ coinage here, though, which is more about specifying what questions we’re not trying to answer.