Is AI self-aware? Mechanistic Evidence from Activation Steering
It’s not exactly the hard question.
But are they self-aware? And how do you measure that, in a transformer model?
My paper shows that in some ways, models can actually see themselves:
It’s not exactly the hard question.
But are they self-aware? And how do you measure that, in a transformer model?
My paper shows that in some ways, models can actually see themselves:
Long form coming soon with a technical breakdown of what we did.