Physicalist agents see themselves as inhabiting an unprivileged position within the universe. However, it’s unclear whether humans should be regarded as such agents. Indeed, monotonicity is highly counterintuitive for humans. Moreover, historically human civilization struggled a lot with accepting the Copernican principle (and is still confused about issues such as free will, anthropics and quantum physics which physicalist agents shouldn’t be confused about). This presents a problem for superimitation.
What if humans are actually cartesian agents? Then, it makes sense to consider a variant of physicalist superimitation where instead of just seeing itself as unprivileged, the AI sees the user as a privileged agent. We call such agents “transcartesian”. Here is how this can be formalized as a modification of IBP.
In IBP, a hypothesis is specified by choosing the state space Φ and the belief Θ∈□(Γ×Φ). In the transcartesian framework, we require that a hypothesis is augmented by a mapping τ:Φ→(A0×O0)≤ω, where A0 is the action set of the reference agent (user) and O0 is the observation set of the reference agent. Given G0 the source code of the reference agent, we require that Θ is supported on the set
{(y,x)∈Γ×Φ∣∣ha⊑τ(x)⟹a=Gy0(h)}
That is, the actions of the reference agent are indeed computed by the source code of the reference agent.
Now, instead of using a loss function of the form L:elΓ→R, we can use a loss function of the form L:(A0×O0)≤ω→R which doesn’t have to satisfy any monotonicity constraint. (More generally, we can consider hybrid loss functions of the form L:(A0×O0)≤ω×elΓ→R monotonic in the second argument.) This can also be generalized to reference agents with hidden rewards.
As opposed to physicalist agents, transcartesian agents do suffer from penalties associated with the description complexity of bridge rules (for the reference agent). Such an agent can (for example) come to believe in a simulation hypothesis that is unlikely from a physicalist perspective. However, since such a simulation hypothesis would be compelling for the reference agent as well, this is not an alignment problem (epistemic alignment is maintained).
Physicalist agents see themselves as inhabiting an unprivileged position within the universe. However, it’s unclear whether humans should be regarded as such agents. Indeed, monotonicity is highly counterintuitive for humans. Moreover, historically human civilization struggled a lot with accepting the Copernican principle (and is still confused about issues such as free will, anthropics and quantum physics which physicalist agents shouldn’t be confused about). This presents a problem for superimitation.
What if humans are actually cartesian agents? Then, it makes sense to consider a variant of physicalist superimitation where instead of just seeing itself as unprivileged, the AI sees the user as a privileged agent. We call such agents “transcartesian”. Here is how this can be formalized as a modification of IBP.
In IBP, a hypothesis is specified by choosing the state space Φ and the belief Θ∈□(Γ×Φ). In the transcartesian framework, we require that a hypothesis is augmented by a mapping τ:Φ→(A0×O0)≤ω, where A0 is the action set of the reference agent (user) and O0 is the observation set of the reference agent. Given G0 the source code of the reference agent, we require that Θ is supported on the set
{(y,x)∈Γ×Φ∣∣ha⊑τ(x)⟹a=Gy0(h)}That is, the actions of the reference agent are indeed computed by the source code of the reference agent.
Now, instead of using a loss function of the form L:elΓ→R, we can use a loss function of the form L:(A0×O0)≤ω→R which doesn’t have to satisfy any monotonicity constraint. (More generally, we can consider hybrid loss functions of the form L:(A0×O0)≤ω×elΓ→R monotonic in the second argument.) This can also be generalized to reference agents with hidden rewards.
As opposed to physicalist agents, transcartesian agents do suffer from penalties associated with the description complexity of bridge rules (for the reference agent). Such an agent can (for example) come to believe in a simulation hypothesis that is unlikely from a physicalist perspective. However, since such a simulation hypothesis would be compelling for the reference agent as well, this is not an alignment problem (epistemic alignment is maintained).