Is AI self-aware? Mechanistic Evidence from Activation Steering

ZackDadfar13 Feb 2026 15:30 UTC

2 points

1 comment1 min readLW link

AI Interpretability (ML & AI)Rationality

It’s not exactly the hard question.

But are they self-aware? And how do you measure that, in a transformer model?

My paper shows that in some ways, models can actually see themselves:

[2602.11358] When Models Examine Themselves: Vocabulary-Activation Correspondence in Self-Referential Processing

ZackDadfar13 Feb 2026 15:30 UTC

2 points

1 comment1 min readLW link

AI Interpretability (ML & AI)Rationality