In the previous post, I discussed ways that the internal structure of an algorithm might, given the right normative assumption, allow us to distinguish bias from reward.
Here I’ll be pushing the modelling a bit further.
Self-modelling
Consider the same toy anchoring biased problem as the previous post, with the human algorithm H, some object X, a random integer 0≤n≤99, and an anchoring bias given by
H(X,n)=34V(X)+14n,
for V some valuation function that is independent of n.
On these inputs, the internal structure of H is:
However, H is capable of self-modelling, to allow it to make long term decisions. At time t, H models itself at time t+1 as:
Note that H is in error here: it doesn’t take into account the influence of n on its own behaviour.
In this situation, it could be justifiable to say that H’s self model is the correct model of its own values. And, in that case, the anchoring bias can safely be dismissed as a bias.
Self-model and preparation
Let’s make the previous setup a bit more complicated, and consider that, sometimes, the agent H is aware of the effect of n, and sometimes they aren’t.
At time t, they also have an extra action choice: either n , which will block its future self from seeing n, or ∅, which will proceed as normal. Suppose further that whenever H is aware of the effect of n, they take action n:
And when H isn’t aware of the effect of n, they don’t take any action/takes ∅:
Then it seems very justifiable to see H as opposing the anchoring effect in themselves, and thus classifying it as a bias rather than a value/preference/reward.
The philosophical position
The examples in this post seem stronger than in the previous one, in terms of justifying “the anchoring bias is actually a bias”.
More importantly, there is a philosophical justification, not just an ad hoc one. We are assuming that H has a self model of their own values—they have a model of what is a value and what is a bias in their own behaviour.
Then we can define the reward of H, as the reward that H models itself as having.
In subsequent posts, I’ll explore whether this definition is justified, how to access these self-models, and what can be done about errors and contradictions in self-models.
Beyond algorithmic equivalence: self-modelling
In the previous post, I discussed ways that the internal structure of an algorithm might, given the right normative assumption, allow us to distinguish bias from reward.
Here I’ll be pushing the modelling a bit further.
Self-modelling
Consider the same toy anchoring biased problem as the previous post, with the human algorithm H, some object X, a random integer 0≤n≤99, and an anchoring bias given by
for V some valuation function that is independent of n.
On these inputs, the internal structure of H is:
However, H is capable of self-modelling, to allow it to make long term decisions. At time t, H models itself at time t+1 as:
Note that H is in error here: it doesn’t take into account the influence of n on its own behaviour.
In this situation, it could be justifiable to say that H’s self model is the correct model of its own values. And, in that case, the anchoring bias can safely be dismissed as a bias.
Self-model and preparation
Let’s make the previous setup a bit more complicated, and consider that, sometimes, the agent H is aware of the effect of n, and sometimes they aren’t.
At time t, they also have an extra action choice: either n , which will block its future self from seeing n, or ∅, which will proceed as normal. Suppose further that whenever H is aware of the effect of n, they take action n:
And when H isn’t aware of the effect of n, they don’t take any action/takes ∅:
Then it seems very justifiable to see H as opposing the anchoring effect in themselves, and thus classifying it as a bias rather than a value/preference/reward.
The philosophical position
The examples in this post seem stronger than in the previous one, in terms of justifying “the anchoring bias is actually a bias”.
More importantly, there is a philosophical justification, not just an ad hoc one. We are assuming that H has a self model of their own values—they have a model of what is a value and what is a bias in their own behaviour.
Then we can define the reward of H, as the reward that H models itself as having.
In subsequent posts, I’ll explore whether this definition is justified, how to access these self-models, and what can be done about errors and contradictions in self-models.