Is AI self-aware? Mechanistic Evidence from Activation Steering

It’s not exactly the hard question.

But are they self-aware? And how do you measure that, in a transformer model?

My paper shows that in some ways, models can actually see themselves:

[2602.11358] When Models Examine Themselves: Vocabulary-Activation Correspondence in Self-Referential Processing