[Question] Why do we care about agency for alignment?

Many people believe that understanding “agency” is crucial for alignment, but as far as I know, there isn’t a canonical list of reasons why we care about agency. Please describe any reasons why we might care about the concept of agency for understanding alignment below. If you have multiple reasons, please list them in separate answers below.

Please also try to be specific as possible about what our goal is in the scenario. For example:

We want to know what an agent is so that we can determine whether or not a given AI is a dangerous agent

Whilst useful isn’t quite as good as:

We have an AI which may or may not have goals aligned with us and we want to know what will happen if these goals aren’t aligned. In particular, we are worried that the AI may develop instrumental incentives to seek power and we want to use interpretability tools to help us figure out how worried we should be for a particular system.

We can imagine a scale of such systems according to how it behaves in novel situations:

  • On the low end of this scale, we have a system that only follows extremely broad heuristics that it learned during training

  • On the high end of this scale, we have a system that uses general reasoning capabilities to discover new and innovative strategies even if it has never used anything like this strategy before, or seen it in the training data

We can think of such a scale as a concretisation of agency. This scale provides an indication of both:

  • How dangerous a system is likely to be: though a system with excellent heuristics and very little agency could be dangerous as well

  • What kind of precautions we might want to take: for example, how much can we rely on our off-switch to disable the agent

It’s plausible that interpretability tools might be able to give us some idea of where an agent is on this scale just by looking at the weights. So having a clear definition of this scale could help by clarifying what we are looking for.

In a few days, I’ll add any use cases I’m aware of myself that either haven’t been covered or that I don’t think have been adequately explained by different answers.