However, I also think that open agency approaches to transparency face two key difficulties: competitiveness and safety-of-the-components.[18]
I think a third key difficulty with this class of approaches is something like “emergent agency”, i.e. that each of the individual components seem to be doing something safe, but when you combine several of the agents, you get a scary agent. Intuition pump: each of the weights in a NN is very understandable (it’s just a number) and not doing dangerous scheming, but if you compose them it might be scary. Analagously, each of the subagents in the open agency AI might not be scheming, but a collection of these agents might be scheming.
Understanding the communications between the components seems like it may or may not be sufficient to mitigate this failure mode. If the understanding is “local”, i.e. looking at a particular chain of reasoning and verifying that it is valid, this is probably not sufficient to mitigate the problem, as scary reasoning might be made up of a bunch of small chains of local valid reasoning that looks safe. So I think you want something like a reasonable global picture of the reasoning that the open agent is doing in order to mitigate “emergent agency”.
I think this is kind of related to types of the “safety of the components” failure mode you talk about, particularly in the analogue to the corporation passing memos around, but the memos not corresponding to the “real reasoning” going on. However, it could be that the “real reasoning” emerges on a higher level of abstraction than the individual agents.
This sort of threat model leads me to think that if we’re aiming for this sort of open agency, we shouldn’t do end-to-end training of the whole system, lest we incentivize “emergent agency”, even if we don’t make the individual components less safe.
I think a third key difficulty with this class of approaches is something like “emergent agency”, i.e. that each of the individual components seem to be doing something safe, but when you combine several of the agents, you get a scary agent. Intuition pump: each of the weights in a NN is very understandable (it’s just a number) and not doing dangerous scheming, but if you compose them it might be scary. Analagously, each of the subagents in the open agency AI might not be scheming, but a collection of these agents might be scheming.
Understanding the communications between the components seems like it may or may not be sufficient to mitigate this failure mode. If the understanding is “local”, i.e. looking at a particular chain of reasoning and verifying that it is valid, this is probably not sufficient to mitigate the problem, as scary reasoning might be made up of a bunch of small chains of local valid reasoning that looks safe. So I think you want something like a reasonable global picture of the reasoning that the open agent is doing in order to mitigate “emergent agency”.
I think this is kind of related to types of the “safety of the components” failure mode you talk about, particularly in the analogue to the corporation passing memos around, but the memos not corresponding to the “real reasoning” going on. However, it could be that the “real reasoning” emerges on a higher level of abstraction than the individual agents.
This sort of threat model leads me to think that if we’re aiming for this sort of open agency, we shouldn’t do end-to-end training of the whole system, lest we incentivize “emergent agency”, even if we don’t make the individual components less safe.