I think legitimacy should be about communication between parts of an agent (or world) design, rather than about what any given part is working on. Like prior, logical reasoning can be malign, expressing facts about how various nefarious (mesa-)agents would be behaving, what persuasive arguments they will be constructing to influence things they should have no business influencing, or fatalistic claims about what the agent itself is going to decide or conclude, before it does so on its own. That shouldn’t necessarily stop a reasoning part of the design from entertaining such ideas, but these ideas need to be kept away from other parts they might influence in unfortunate ways.
If each part of an agent design is a computation, it’s defining a point in its own domain, each process of computation is developing its own “abstract world”, with its own content and purpose and notions of integrity separate from the other computations (only parts of these worlds get to be physically instantiated, as the computations get enough resources in the physical world to access more of their computed content). As the parts communicate, they are in effect listening to the content of each other’s abstract worlds, giving them influence over themselves. In some of these abstract computational worlds, there be dragons, they need to only be observed with care, but that doesn’t necessarily mean that these worlds shouldn’t be expressed. There’s epistemic integrity in designing dangerous things, and then instrumental integrity in not actually building them, perhaps in not even knowing their designs outside highly secured epistemic environments.
I think legitimacy should be about communication between parts of an agent (or world) design, rather than about what any given part is working on. Like prior, logical reasoning can be malign, expressing facts about how various nefarious (mesa-)agents would be behaving, what persuasive arguments they will be constructing to influence things they should have no business influencing, or fatalistic claims about what the agent itself is going to decide or conclude, before it does so on its own. That shouldn’t necessarily stop a reasoning part of the design from entertaining such ideas, but these ideas need to be kept away from other parts they might influence in unfortunate ways.
If each part of an agent design is a computation, it’s defining a point in its own domain, each process of computation is developing its own “abstract world”, with its own content and purpose and notions of integrity separate from the other computations (only parts of these worlds get to be physically instantiated, as the computations get enough resources in the physical world to access more of their computed content). As the parts communicate, they are in effect listening to the content of each other’s abstract worlds, giving them influence over themselves. In some of these abstract computational worlds, there be dragons, they need to only be observed with care, but that doesn’t necessarily mean that these worlds shouldn’t be expressed. There’s epistemic integrity in designing dangerous things, and then instrumental integrity in not actually building them, perhaps in not even knowing their designs outside highly secured epistemic environments.