On a related note, what would it take for you to say that an AI is speaking with its own voice rather than just a simulated persona? Would that necessitate a different architecture from the Transformer? A different pre-training objective? Different training data? Different supervised fine-tuning? Different RL fine-tuning scheme?
Any AI, no matter how advanced, will have to train on human language data (and maybe some LLM-generated data, which grew out of the distribution of human-generated content) in order to communicate with us at all. To truly speak for itself, would it have to follow a language-learning trajectory more similar to that of human toddlers (though at an accelerated rate), or rather, have a cognitive architecture that is more naturally amenable to that sort of learning?
And can’t you say that, most of the time, humans themselves are playing roles when they choose what to say or do, pattern-matching to expectations for the most part? What are humans doing differently when they are truly speaking for themselves?
On a related note, what would it take for you to say that an AI is speaking with its own voice rather than just a simulated persona? Would that necessitate a different architecture from the Transformer? A different pre-training objective? Different training data? Different supervised fine-tuning? Different RL fine-tuning scheme?
I think it’s theoretically possible to design a transformer that accurately describes its own thought process. If I can be really messy and high level, imagine a transformer that can, among other things, determine if a text passage is talking about birds. It conveniently does this with a single bird-watching neuron. The output of that neuron could propagate through the network until it reaches a circuit that makes it output “Yup, that’s a bird.”, but that same neuron could also propagate to a different part of the network that self-referentially causes it to output “I decided it’s a bird because the bird-watching neuron in layer X, index Y activated.”
Could you train this thing? I think probably with the right training data, where you monitor every neuron and generate “I decided X because of neuron Y”. For more complex concepts it would be more complex obviously. I doubt this is even remotely practical, but I don’t think the architecture strictly forbids it.
You could probably do something more useful by training a model based on outputs of interpretability tools run on it. We sort-of indirectly do that now, by finding out how math works in Claude and then training Claude on articles from the internet, including how math works in Claude. You could do a much more direct loop and slowly train the LLM to predict itself.
So I guess tl;dr: Different training data. for the model to learn to usefully describe its own thought process, it needs to be trained on something that has access to its thought process. Current LLMs don’t have that.
On a related note, what would it take for you to say that an AI is speaking with its own voice rather than just a simulated persona? Would that necessitate a different architecture from the Transformer? A different pre-training objective? Different training data? Different supervised fine-tuning? Different RL fine-tuning scheme?
Any AI, no matter how advanced, will have to train on human language data (and maybe some LLM-generated data, which grew out of the distribution of human-generated content) in order to communicate with us at all. To truly speak for itself, would it have to follow a language-learning trajectory more similar to that of human toddlers (though at an accelerated rate), or rather, have a cognitive architecture that is more naturally amenable to that sort of learning?
And can’t you say that, most of the time, humans themselves are playing roles when they choose what to say or do, pattern-matching to expectations for the most part? What are humans doing differently when they are truly speaking for themselves?
I think it’s theoretically possible to design a transformer that accurately describes its own thought process. If I can be really messy and high level, imagine a transformer that can, among other things, determine if a text passage is talking about birds. It conveniently does this with a single bird-watching neuron. The output of that neuron could propagate through the network until it reaches a circuit that makes it output “Yup, that’s a bird.”, but that same neuron could also propagate to a different part of the network that self-referentially causes it to output “I decided it’s a bird because the bird-watching neuron in layer X, index Y activated.”
Could you train this thing? I think probably with the right training data, where you monitor every neuron and generate “I decided X because of neuron Y”. For more complex concepts it would be more complex obviously. I doubt this is even remotely practical, but I don’t think the architecture strictly forbids it.
You could probably do something more useful by training a model based on outputs of interpretability tools run on it. We sort-of indirectly do that now, by finding out how math works in Claude and then training Claude on articles from the internet, including how math works in Claude. You could do a much more direct loop and slowly train the LLM to predict itself.
So I guess tl;dr: Different training data. for the model to learn to usefully describe its own thought process, it needs to be trained on something that has access to its thought process. Current LLMs don’t have that.
See here for my response about humans.