Our hypothetical agent knows that it is an agent. I can’t yet formalize what I mean by this, but I think that it requires probability distributions corresponding to a certain causal structure, which would allow us to distinguish it from the other graphs
How about: an agent, relative to a given situation described by a causal graph G, is an entity that can perform do-actions on G.
No, that’s not what I meant at all. In what you said, the agent needs to be separate from the system in order to preform do-actions. I want an agent that knows it’s an agent, so it has to have a self-model and, in particular, has to be inside the system that is modelled by our causal graph.
One of the guiding heuristics in FAI theory is that an agent should model itself the same way it models other things. Roughly, the agent isn’t actually tagged as different from nonagent things in reality, so any desired behaviour that depends on correctly making this distinction cannot be regulated with evidence as to whether it is actually making the distinction the way we want it to. A common example of this is the distinction between self-modification and creating a successor AI; an FAI should not need to distinguish these, since they’re functionally the same. These sorts of ideas are why I want the agent to be modelled within its own causal graph.
How about: an agent, relative to a given situation described by a causal graph G, is an entity that can perform do-actions on G.
No, that’s not what I meant at all. In what you said, the agent needs to be separate from the system in order to preform do-actions. I want an agent that knows it’s an agent, so it has to have a self-model and, in particular, has to be inside the system that is modelled by our causal graph.
One of the guiding heuristics in FAI theory is that an agent should model itself the same way it models other things. Roughly, the agent isn’t actually tagged as different from nonagent things in reality, so any desired behaviour that depends on correctly making this distinction cannot be regulated with evidence as to whether it is actually making the distinction the way we want it to. A common example of this is the distinction between self-modification and creating a successor AI; an FAI should not need to distinguish these, since they’re functionally the same. These sorts of ideas are why I want the agent to be modelled within its own causal graph.