sloonz comments on Agentic Interpretability: A Strategy Against Gradual Disempowerment

sloonz 18 Jun 2025 9:33 UTC
3 points
0
I don’t understand how it is an answer to gradual disempowerement.

In the GD scenario, both individual humans and AI understand that :
- AI-powered governance/economy leads to better “utilitarian outcome” (more wealth, better governance)
- AI and humans understand that this leads to disempowerement
- Moral/economics pressure leads to that disempowerement
The problem is not in the understanding. It is in the incentives. The threat model was never “oops, we did not understood that dynamics, and now we are disempowered… if only we knew…”. We already know before it happens — that’s the point of the original GD paper. Having an LLM understand and explain that dynamics does not help avert that scenario ?
- Neel Nanda 18 Jun 2025 11:22 UTC
  2 points
  0
  Parent
  I wouldn’t go as far as calling it an answer, but I think it helps. The mechanism is by lowering the diff between agent powered things and things where a human understands what is being done or is otherwise in the loop