Vanessa Kosoy comments on Vanessa Kosoy’s Shortform

Vanessa Kosoy 13 May 2023 12:50 UTC
LW: 2 AF: 2
0
AF
Physicalist agents see themselves as inhabiting an unprivileged position within the universe. However, it’s unclear whether humans should be regarded as such agents. Indeed, monotonicity is highly counterintuitive for humans. Moreover, historically human civilization struggled a lot with accepting the Copernican principle (and is still confused about issues such as free will, anthropics and quantum physics which physicalist agents shouldn’t be confused about). This presents a problem for superimitation.
What if humans are actually cartesian agents? Then, it makes sense to consider a variant of physicalist superimitation where instead of just seeing itself as unprivileged, the AI sees the user as a privileged agent. We call such agents “transcartesian”. Here is how this can be formalized as a modification of IBP.
In IBP, a hypothesis is specified by choosing the state space $Φ$ and the belief $Θ \in □ (Γ \times Φ)$ . In the transcartesian framework, we require that a hypothesis is augmented by a mapping $τ : Φ \to (A_{0} \times O_{0})^{\leq ω}$ , where $A_{0}$ is the action set of the reference agent (user) and $O_{0}$ is the observation set of the reference agent. Given $G_{0}$ the source code of the reference agent, we require that $Θ$ is supported on the set
${(y, x) \in Γ \times Φ ∣ ∣ h a ⊑ τ (x) ⟹ a = G_{0}^{y} (h)}$
That is, the actions of the reference agent are indeed computed by the source code of the reference agent.
Now, instead of using a loss function of the form $L : {e l}^{Γ} \to R$ , we can use a loss function of the form $L : (A_{0} \times O_{0})^{\leq ω} \to R$ which doesn’t have to satisfy any monotonicity constraint. (More generally, we can consider hybrid loss functions of the form $L : (A_{0} \times O_{0})^{\leq ω} \times {e l}^{Γ} \to R$ monotonic in the second argument.) This can also be generalized to reference agents with hidden rewards.
As opposed to physicalist agents, transcartesian agents do suffer from penalties associated with the description complexity of bridge rules (for the reference agent). Such an agent can (for example) come to believe in a simulation hypothesis that is unlikely from a physicalist perspective. However, since such a simulation hypothesis would be compelling for the reference agent as well, this is not an alignment problem (epistemic alignment is maintained).
What links here?
- The Learning-Theoretic Agenda: Status 2023 by Vanessa Kosoy (19 Apr 2023 5:21 UTC; 144 points)