I continue to endorse the claims made here, although in the end the project turned out to be somewhat redundant with an existing paper I hadn’t seen.
The core claim, that that LLMs quickly build a model of users and other text authors, is now fairly widely known as ‘truesight’.
I still think there’s quite a lot of interesting and valuable follow-up work that can be done here even though my own research directions have shifted elsewhere[1], and I’m very happy to discuss it with anyone interested in doing that work! One place to start would be a straightforward replication with newer models — this work was on GPT-3.5-Turbo (which is absolutely ancient in LLM years) and I expect that current models can do this much more effectively. Otherwise the follow-up I propose in the discussion section all seems valuable. I think my proposed general metric was correct[2].
I was already seeing glimmers of my current research topic. I say in the discussion ‘if we can learn more about models’ self-understanding, we can potentially shape that process to ensure models are well-aligned, and detect ways in which they might not be’, and that’s basically my focus now.
I would now express it as: how much does perplexity on a user’s text decrease as a function of how much of that user’s other text the model has seen (averaged across users and texts)? Still, guessing demographics was useful as a metric that quickly conveyed how much a model had learned.
I continue to endorse the claims made here, although in the end the project turned out to be somewhat redundant with an existing paper I hadn’t seen.
The core claim, that that LLMs quickly build a model of users and other text authors, is now fairly widely known as ‘truesight’.
I still think there’s quite a lot of interesting and valuable follow-up work that can be done here even though my own research directions have shifted elsewhere[1], and I’m very happy to discuss it with anyone interested in doing that work! One place to start would be a straightforward replication with newer models — this work was on GPT-3.5-Turbo (which is absolutely ancient in LLM years) and I expect that current models can do this much more effectively. Otherwise the follow-up I propose in the discussion section all seems valuable. I think my proposed general metric was correct[2].
I was already seeing glimmers of my current research topic. I say in the discussion ‘if we can learn more about models’ self-understanding, we can potentially shape that process to ensure models are well-aligned, and detect ways in which they might not be’, and that’s basically my focus now.
I would now express it as: how much does perplexity on a user’s text decrease as a function of how much of that user’s other text the model has seen (averaged across users and texts)? Still, guessing demographics was useful as a metric that quickly conveyed how much a model had learned.