Mitchell_Porter comments on Open Thread Autumn 2025

Mitchell_Porter 7 Oct 2025 20:12 UTC
2 points
0
AI interpretability can assign meaning to states of an AI, but what about process? Are there principled ways of concluding that an AI is thinking, deciding, trying, and so on?
- Charlie Steiner 12 Oct 2025 1:33 UTC
  4 points
  0
  Parent
  If it can assign meaning to states, then sure why not? Currently this comes with plenty of caveats, so it kind of depends on how much you want to stick about principledness and effectiveness.
  Sometimes “deciding” etc. is represented in the activations, which is kind of trivial. So you can also be asking about interpreting the parameters of the AI that transform one state to another. Keywords might be circuit interpretability, automated circuit discovery, or parameter decomposition.