Charlie Steiner comments on Agentic Interpretability: A Strategy Against Gradual Disempowerment

Charlie Steiner 25 Jun 2025 19:50 UTC
2 points
0
I was asking more “how does the AI get a good model of itself”, but your answer was still interesting, thanks. Still not sure if you think there’s some straightforward ways future AI will get such a model that all come out more or less at the starting point of your proposal. (Or if not.)
Here’s another take for you: this is like Eliciting Latent Knowledge (with extra hope placed on cogsci methods), except where I take ELK to be asking “how do you communicate with humans to make them good at RL feedback,” you’re asking “how do you communicate with humans to make them good at participating in verbal chain of thought?”