AI interpretability can assign meaning to states of an AI, but what about process? Are there principled ways of concluding that an AI is thinking, deciding, trying, and so on?
If it can assign meaning to states, then sure why not? Currently this comes with plenty of caveats, so it kind of depends on how much you want to stick about principledness and effectiveness.
Sometimes “deciding” etc. is represented in the activations, which is kind of trivial. So you can also be asking about interpreting the parameters of the AI that transform one state to another. Keywords might be circuit interpretability, automated circuit discovery, or parameter decomposition.
AI interpretability can assign meaning to states of an AI, but what about process? Are there principled ways of concluding that an AI is thinking, deciding, trying, and so on?
If it can assign meaning to states, then sure why not? Currently this comes with plenty of caveats, so it kind of depends on how much you want to stick about principledness and effectiveness.
Sometimes “deciding” etc. is represented in the activations, which is kind of trivial. So you can also be asking about interpreting the parameters of the AI that transform one state to another. Keywords might be circuit interpretability, automated circuit discovery, or parameter decomposition.