Beyond algorithmic equivalence: self-modelling

In the previous post, I discussed ways that the internal structure of an algorithm might, given the right normative assumption, allow us to distinguish bias from reward.

Here I’ll be pushing the modelling a bit further.

Self-modelling

Consider the same toy anchoring biased problem as the previous post, with the human algorithm $H$ , some object $X$ , a random integer $0 \leq n \leq 99$ , and an anchoring bias given by

H (X, n) = \frac{3}{4} V (X) + \frac{1}{4} n,

for $V$ some valuation function that is independent of $n$ .

On these inputs, the internal structure of $H$ is:

However, $H$ is capable of self-modelling, to allow it to make long term decisions. At time $t$ , $H$ models itself at time $t + 1$ as:

Note that $H$ is in error here: it doesn’t take into account the influence of $n$ on its own behaviour.

In this situation, it could be justifiable to say that $H$ ’s self model is the correct model of its own values. And, in that case, the anchoring bias can safely be dismissed as a bias.

Self-model and preparation

Let’s make the previous setup a bit more complicated, and consider that, sometimes, the agent $H$ is aware of the effect of $n$ , and sometimes they aren’t.

At time $t$ , they also have an extra action choice: either $n$ , which will block its future self from seeing $n$ , or $\emptyset$ , which will proceed as normal. Suppose further that whenever $H$ is aware of the effect of $n$ , they take action $n$ :

And when $H$ isn’t aware of the effect of $n$ , they don’t take any action/takes $\emptyset$ :

Then it seems very justifiable to see $H$ as opposing the anchoring effect in themselves, and thus classifying it as a bias rather than a value/preference/reward.

The philosophical position

The examples in this post seem stronger than in the previous one, in terms of justifying “the anchoring bias is actually a bias”.

More importantly, there is a philosophical justification, not just an ad hoc one. We are assuming that $H$ has a self model of their own values—they have a model of what is a value and what is a bias in their own behaviour.

Then we can define the reward of $H$ , as the reward that $H$ models itself as having.

In subsequent posts, I’ll explore whether this definition is justified, how to access these self-models, and what can be done about errors and contradictions in self-models.