# Beyond algorithmic equivalence: self-modelling

In the pre­vi­ous post, I dis­cussed ways that the in­ter­nal struc­ture of an al­gorithm might, given the right nor­ma­tive as­sump­tion, al­low us to dis­t­in­guish bias from re­ward.

Here I’ll be push­ing the mod­el­ling a bit fur­ther.

## Self-modelling

Con­sider the same toy an­chor­ing bi­ased prob­lem as the pre­vi­ous post, with the hu­man al­gorithm , some ob­ject , a ran­dom in­te­ger , and an an­chor­ing bias given by

for some val­u­a­tion func­tion that is in­de­pen­dent of .

On these in­puts, the in­ter­nal struc­ture of is:

How­ever, is ca­pa­ble of self-mod­el­ling, to al­low it to make long term de­ci­sions. At time , mod­els it­self at time as:

Note that is in er­ror here: it doesn’t take into ac­count the in­fluence of on its own be­havi­our.

In this situ­a­tion, it could be jus­tifi­able to say that ’s self model is the cor­rect model of its own val­ues. And, in that case, the an­chor­ing bias can safely be dis­missed as a bias.

## Self-model and prepa­ra­tion

Let’s make the pre­vi­ous setup a bit more com­pli­cated, and con­sider that, some­times, the agent is aware of the effect of , and some­times they aren’t.

At time , they also have an ex­tra ac­tion choice: ei­ther , which will block its fu­ture self from see­ing , or , which will pro­ceed as nor­mal. Sup­pose fur­ther that when­ever is aware of the effect of , they take ac­tion :

And when isn’t aware of the effect of , they don’t take any ac­tion/​takes :

Then it seems very jus­tifi­able to see as op­pos­ing the an­chor­ing effect in them­selves, and thus clas­sify­ing it as a bias rather than a value/​prefer­ence/​re­ward.

## The philo­soph­i­cal position

The ex­am­ples in this post seem stronger than in the pre­vi­ous one, in terms of jus­tify­ing “the an­chor­ing bias is ac­tu­ally a bias”.

More im­por­tantly, there is a philo­soph­i­cal jus­tifi­ca­tion, not just an ad hoc one. We are as­sum­ing that has a self model of their own val­ues—they have a model of what is a value and what is a bias in their own be­havi­our.

Then we can define the re­ward of , as the re­ward that mod­els it­self as hav­ing.

In sub­se­quent posts, I’ll ex­plore whether this defi­ni­tion is jus­tified, how to ac­cess these self-mod­els, and what can be done about er­rors and con­tra­dic­tions in self-mod­els.

• I agree that the agent should be able to make a de­cent effort at tel­ling us which of its drives are bi­ases (/​ad­dic­tions) ver­sus val­ues. One com­pli­cat­ing fac­tor is that agents change their opinions about these mat­ters over time. Imag­ine a philoso­pher who uses the drug heroin. They may very well vac­illate on whether heroin satis­fies their full-prefer­ences, even if the ex­pe­rience of tak­ing heroin is not chang­ing. This could hap­pen via in­tro­spec­tion, via philo­soph­i­cal in­ves­ti­ga­tion, via ex­am­in­ing fMRI scans, et cetera. It’s tricky for the hu­man to state their bi­ases with con­fi­dence be­cause they may never know when they are done up­dat­ing on the mat­ter.

In­tu­itively, an agent might want the AI sys­tem to do this ex­am­i­na­tion and then to max­i­mize what­ever turns out to be valuable. That is, you might want the bias-model to be the one that you would set­tle on if you thought for a long time, similarly to en­light­ened self-in­ter­est /​ ex­trap­o­lated vo­li­tion mod­els. Similar prob­lems en­sue: e.g., it this pro­cess may di­verge. Or it may be fun­da­men­tally in­de­ter­mi­nate whether some drives are val­ues or bi­ases.

• >One com­pli­cat­ing fac­tor is that agents change their opinions about these mat­ters over time.

Yep! This is one of the ma­jor is­sues, and one that I’ll try to model in a soon-to-be-com­ing post. The whole is­sue of rigged and in­flue­ce­able learn­ing pro­cesses is con­nected with try­ing to learn the prefer­ences of such an agent.

>Or it may be fun­da­men­tally in­de­ter­mi­nate whether some drives are val­ues or bi­ases.

I think it’s fun­da­men­tally in­de­ter­mi­nate in prin­ci­ple, but we can make some good judge­ments in prac­tice.

• Ooooh, I like where this is go­ing. I re­al­ize you still have more to de­velop on this idea, but is your thought that this could re­place the use of ob­jec­tive re­ward func­tions that ex­ist out­side the agent?