Beyond algorithmic equivalence: self-modelling

In the pre­vi­ous post, I dis­cussed ways that the in­ter­nal struc­ture of an al­gorithm might, given the right nor­ma­tive as­sump­tion, al­low us to dis­t­in­guish bias from re­ward.

Here I’ll be push­ing the mod­el­ling a bit fur­ther.


Con­sider the same toy an­chor­ing bi­ased prob­lem as the pre­vi­ous post, with the hu­man al­gorithm , some ob­ject , a ran­dom in­te­ger , and an an­chor­ing bias given by

for some val­u­a­tion func­tion that is in­de­pen­dent of .

On these in­puts, the in­ter­nal struc­ture of is:

How­ever, is ca­pa­ble of self-mod­el­ling, to al­low it to make long term de­ci­sions. At time , mod­els it­self at time as:

Note that is in er­ror here: it doesn’t take into ac­count the in­fluence of on its own be­havi­our.

In this situ­a­tion, it could be jus­tifi­able to say that ’s self model is the cor­rect model of its own val­ues. And, in that case, the an­chor­ing bias can safely be dis­missed as a bias.

Self-model and prepa­ra­tion

Let’s make the pre­vi­ous setup a bit more com­pli­cated, and con­sider that, some­times, the agent is aware of the effect of , and some­times they aren’t.

At time , they also have an ex­tra ac­tion choice: ei­ther , which will block its fu­ture self from see­ing , or , which will pro­ceed as nor­mal. Sup­pose fur­ther that when­ever is aware of the effect of , they take ac­tion :

And when isn’t aware of the effect of , they don’t take any ac­tion/​takes :

Then it seems very jus­tifi­able to see as op­pos­ing the an­chor­ing effect in them­selves, and thus clas­sify­ing it as a bias rather than a value/​prefer­ence/​re­ward.

The philo­soph­i­cal position

The ex­am­ples in this post seem stronger than in the pre­vi­ous one, in terms of jus­tify­ing “the an­chor­ing bias is ac­tu­ally a bias”.

More im­por­tantly, there is a philo­soph­i­cal jus­tifi­ca­tion, not just an ad hoc one. We are as­sum­ing that has a self model of their own val­ues—they have a model of what is a value and what is a bias in their own be­havi­our.

Then we can define the re­ward of , as the re­ward that mod­els it­self as hav­ing.

In sub­se­quent posts, I’ll ex­plore whether this defi­ni­tion is jus­tified, how to ac­cess these self-mod­els, and what can be done about er­rors and con­tra­dic­tions in self-mod­els.