beenkim comments on Agentic Interpretability: A Strategy Against Gradual Disempowerment

beenkim 24 Jun 2025 20:06 UTC
3 points
0
We hope to borrow much ideas from the cogsci work, where mental models between people (e.g., co-working, teacher/student situations) are well studied. This work that we cite may give a good idea of the flavor: https://langcog.stanford.edu/papers_new/goodman-2016-tics.pdf or https://royalsocietypublishing.org/doi/abs/10.1098/rsta.2022.0048. In other words, cogsci folks have been studying how humans work together to understand each other to work better together or to enable better education, and the agentic interpretability is advocating to do something similar (tho it may look very different) with machines.
- Charlie Steiner 25 Jun 2025 19:50 UTC
  2 points
  0
  Parent
  I was asking more “how does the AI get a good model of itself”, but your answer was still interesting, thanks. Still not sure if you think there’s some straightforward ways future AI will get such a model that all come out more or less at the starting point of your proposal. (Or if not.)
  Here’s another take for you: this is like Eliciting Latent Knowledge (with extra hope placed on cogsci methods), except where I take ELK to be asking “how do you communicate with humans to make them good at RL feedback,” you’re asking “how do you communicate with humans to make them good at participating in verbal chain of thought?”