# Figuring out what Alice wants, part II

This post con­tinues the anal­y­sis started in the pre­vi­ous post. Here I will pre­sent some ex­am­ples of al­gorithms op­er­at­ing in cer­tain en­vi­ron­ments, see­ing how the model cer­tain things, and us­ing that to con­clude facts about their prefer­ences/​goals/​re­ward func­tions.

I’ll be look­ing first at the Poker prob­lem with un­known mo­ti­va­tions pre­sented here, and sec­ondly at a var­i­ant of the Co­de­names game.

## Differ­ent al­gorithms, same out­puts, differ­ent goals

In the Poker ex­am­ple, we are un­sure whether Alice wants to win the hand against Bob, for money, or lose the hand to get into Bob’s good graces. She her­self is un­sure what Bob’s cards are; Bob has been play­ing con­fi­dently, but there is only one card com­bi­na­tion that would al­low him to win.

The in­puts are Alice’s own cards, the five cards on the board, and Bob’s be­havi­our this hand. There are two heuris­tics called by the al­gorithm, , which com­putes the prob­a­bil­ity of Alice win­ning by as­sum­ing Bob has a ran­dom hand, and , which as­sesses the like­li­hood of Alice win­ning by look­ing at Bob’s be­havi­our (at this point you should start wor­ry­ing about the sug­ges­tive names and de­scrip­tions I’m giv­ing to all these el­e­ments).

Now, in the situ­a­tion we find our­selves, what we want to say is that if is close to , then dom­i­nates, and will be high. If is close to , then it will be low, as dom­i­nates the ex­pres­sion. Since there is a in line , we want to say that this Alice Poker al­gorithm is try­ing to win, but, de­pend­ing on the value of , has differ­ent be­liefs about what ac­tion is likely to max­imise ex­pected money.

Similarly, if it were a in line , we’d like to say that the Alice Poker al­gorithm wants to lose. And this, even though the and al­gorithms would both fold (while and would both call).

But this re­lies too much on the in­ter­pre­ta­tion of what the terms mean and what the heuris­tics are meant to do. Now, there isn’t much flex­i­bil­ity in in­ter­pret­ing what is or does: as the qual­ity of Alice’s hand in­creases, rel­a­tive to a ran­dom hand and the given board, ‘s out­put in­creases. Thus is it ‘clearly’ mea­sur­ing rel­a­tive hand qual­ity.

But what of ? All that we know is that this out­puts a num­ber that in­creases when Bob ap­pears con­fi­dent, and de­creases when he ap­pears wor­ried. I’ve said that this is sup­posed to mea­sure how good Bob’s hand is, based on how he be­haved. But how do we know that? Maybe Alice views Bob as an effec­tive bluffer (or a level 2n+1 meta-bluffer), so that a high ac­tu­ally means that she ex­pects Bob to have a poor hand. In that case the would still fold, but would be fold­ing to lose against Bob, not fold­ing to win.

This brings us back to some of the old­est prob­lems in AI (and some of the newest). Namely, what is the se­man­tics and in­ter­pre­ta­tion of what an al­gorithm is do­ing? It’s been ar­gued that you can­not de­rive the se­man­tics of an al­gorithm, only the syn­tax. I’ve dis­agreed with that; ar­gu­ing that when the in­ter­nal syn­tax is suffi­ciently rich and de­tailed, and the agent re­lates well to the real world, then there can be only a a few se­man­tic in­ter­pre­ta­tions of the sym­bols that make any sense. In more mod­ern terms, this can be seen as the prob­lem of al­gorithm in­ter­pretabil­ity, es­pe­cially when it is ap­plied to out-of-train­ing-set dis­tri­bu­tions. If a cer­tain neu­ron trig­gers when see­ing pho­tos of dogs, in­clud­ing pho­tos far from its train­ing dis­tri­bu­tion, then that neu­ron is at least su­perfi­cially con­nected to the con­cept of “dog” (or at least “dog photo”).

So, what would cause us to be­lieve that is ac­tu­ally try­ing to pre­dict Bob’s cards from his re­ac­tions? Well, sup­pose that was a learn­ing al­gorithm, and it up­dated the cor­rect way: when Bob is re­vealed to have had a good hand, it up­dates to­wards a higher value on that in­put set, and vice versa. Or sup­pose that also took and as in­puts, and was higher if Alice’s cards were bet­ter. Then it would seem more jus­tified to see that heuris­tic as ac­tu­ally try­ing to es­ti­mate the prob­a­bil­ity of Alice’s cards beat­ing Bob’s.

## Co­de­names and se­man­tics

In Co­de­names, one player on a team (Spy­mas­ter) tries to sig­nal a card (or a col­lec­tion of cards) to their team­mates. Th­ese cards have words on them, and the Spy­mas­ter names an­other word re­lated to the tar­gets. The Spy­mas­ter also gives a num­ber, to say how many cards this word refers to.

Sup­pose there are four re­main­ing code­words, “Dolphin”, “New York”, “Comet”, and “Lice”. The Spy­mas­ter is Bob, and has only one card re­main­ing to sig­nal; the other player is Alice, who has to in­ter­pret Bob’s sig­nal.

Note that be­fore re­ceiv­ing Bob’s word, the Alice al­gorithm prob­a­bly doesn’t have time to run through all the pos­si­ble words he could give. There­fore, in terms of the first post, in the model frag­ment Alice has of Bob, she ex­pects to be sur­prised by his ac­tion (it’s mainly for this rea­son that I’m in­tro­duc­ing this ex­am­ple).

Any­way, af­ter re­ceiv­ing Bob’s mes­sage—which we will as­sume is “Aquatic, 1 card”—Alice runs the fol­low­ing al­gorithm:

The al­gorithm is very sim­ple: it mod­els Bob as hav­ing a wordmap in his head, that mea­sures the dis­tance be­tween words and con­cepts, and as­sumes that the word he speaks - - is clos­est to the an­swer among the re­main­ing .

Let’s as­sume that, through say­ing “Aquatic”, Bob is try­ing to sig­nal “Dolphin”. But let’s as­sume that Alice’s model of Bob’s wordmap is wrong—it com­putes that “New York” is the clos­est word to “Aquatic” (New York is a coastal city, af­ter all).

Let’s fur­ther as­sume that Alice has a learn­ing com­po­nent to her al­gorithm, so gets up­dated. But the up­dat­ing is ter­rible: it ac­tu­ally moves fur­ther away from the true wordmap with ev­ery data point.

In view of this, can we still say that Alice is try­ing to win the game, but just has a ter­rible model of her team­mate?

Per­haps. The data above is not enough to say that—if we re­named as it would seem to be more fit­ting. But sup­pose that was not a sin­gle al­gorithm, but a sub-com­po­nent of a much more com­pli­cated rou­tine called . And sup­pose that this com­po­nent was de­cently con­nected to Bob—Alice used it when­ever she thinks about what Bob does, it is ac­ti­vated when­ever she sees Bob, and it is closely se­man­ti­cally con­nected to her ex­pe­rience of in­ter­act­ing with Bob.

Then we might be able to say that, though is ac­tu­ally a ter­rible ver­sion of Bob’s men­tal wordmap, the whole is suffi­ciently clearly a model of Bob, and , in other con­texts, is suffi­ciently clearly a wordmap, that we can say that Alice is try­ing and failing to model Bob’s think­ing (rather than de­liber­ately failing the game).

Th­ese are the sort of “in­ter­pretabil­ity” analy­ses that we will have to do to figure out what the al­gorithms in our brains ac­tu­ally want.

## fMRI semantics

A fi­nal, minor point: things like fMRIs can help bridge the se­man­tic-syn­tax gap in hu­mans, as it can, to some ex­tent, liter­ally see how cer­tain con­cepts, ideas, or images are han­dled by the brain. This could be the hu­man equiv­a­lent of hav­ing the al­gorithm laid out as above.

No nominations.
No reviews.