It’s not exactly the hard question.
But are they self-aware? And how do you measure that, in a transformer model?
My paper shows that in some ways, models can actually see themselves:
[2602.11358] When Models Examine Themselves: Vocabulary-Activation Correspondence in Self-Referential Processing
Is AI self-aware? Mechanistic Evidence from Activation Steering
It’s not exactly the hard question.
But are they self-aware? And how do you measure that, in a transformer model?
My paper shows that in some ways, models can actually see themselves:
[2602.11358] When Models Examine Themselves: Vocabulary-Activation Correspondence in Self-Referential Processing