On a related note, what would it take for you to say that an AI is speaking with its own voice rather than just a simulated persona? Would that necessitate a different architecture from the Transformer? A different pre-training objective? Different training data? Different supervised fine-tuning? Different RL fine-tuning scheme?
I think it’s theoretically possible to design a transformer that accurately describes its own thought process. If I can be really messy and high level, imagine a transformer that can, among other things, determine if a text passage is talking about birds. It conveniently does this with a single bird-watching neuron. The output of that neuron could propagate through the network until it reaches a circuit that makes it output “Yup, that’s a bird.”, but that same neuron could also propagate to a different part of the network that self-referentially causes it to output “I decided it’s a bird because the bird-watching neuron in layer X, index Y activated.”
Could you train this thing? I think probably with the right training data, where you monitor every neuron and generate “I decided X because of neuron Y”. For more complex concepts it would be more complex obviously. I doubt this is even remotely practical, but I don’t think the architecture strictly forbids it.
You could probably do something more useful by training a model based on outputs of interpretability tools run on it. We sort-of indirectly do that now, by finding out how math works in Claude and then training Claude on articles from the internet, including how math works in Claude. You could do a much more direct loop and slowly train the LLM to predict itself.
So I guess tl;dr: Different training data. for the model to learn to usefully describe its own thought process, it needs to be trained on something that has access to its thought process. Current LLMs don’t have that.
I think it’s theoretically possible to design a transformer that accurately describes its own thought process. If I can be really messy and high level, imagine a transformer that can, among other things, determine if a text passage is talking about birds. It conveniently does this with a single bird-watching neuron. The output of that neuron could propagate through the network until it reaches a circuit that makes it output “Yup, that’s a bird.”, but that same neuron could also propagate to a different part of the network that self-referentially causes it to output “I decided it’s a bird because the bird-watching neuron in layer X, index Y activated.”
Could you train this thing? I think probably with the right training data, where you monitor every neuron and generate “I decided X because of neuron Y”. For more complex concepts it would be more complex obviously. I doubt this is even remotely practical, but I don’t think the architecture strictly forbids it.
You could probably do something more useful by training a model based on outputs of interpretability tools run on it. We sort-of indirectly do that now, by finding out how math works in Claude and then training Claude on articles from the internet, including how math works in Claude. You could do a much more direct loop and slowly train the LLM to predict itself.
So I guess tl;dr: Different training data. for the model to learn to usefully describe its own thought process, it needs to be trained on something that has access to its thought process. Current LLMs don’t have that.
See here for my response about humans.