Karma: 21,591
Page 1

# All I know is Goodhart

21 Oct 2019 12:12 UTC
11 points

# Full toy model for prefer­ence learning

16 Oct 2019 11:06 UTC
11 points
• You are en­tirely cor­rect; I don’t know why I was con­fused.

How­ever, look­ing at the proof again, it seems there might be a po­ten­tial hole. You use Löb’s the­o­rem within an as­sump­tion sub-loop. This seems to as­sume that from “”, we can de­duce “”.

But this can­not be true in gen­eral! To see this, set . Then , triv­ially; if, from that, we could de­duce , we would have for any . But this state­ment, though it looks like Löb’s the­o­rem, is one that we can­not de­duce in gen­eral (see Eliezer’s “medium-hard prob­lem” here).

Can this hole be patched?

(note that if , where is a PA proof that adds A as an ex­tra ax­iom, then we can de­duce ).

• If the agent’s rea­son­ing is sen­si­ble only un­der cer­tain set­tings of the de­fault ac­tion clause

That was my first rewrit­ing; the sec­ond is an ex­am­ple of a more gen­eral al­gorithm that would go some­thing like this. If we as­sume that both prob­a­bil­ities and util­ities are dis­crete, all of the form q/​n for some q, and bounded above and be­low by N, then some­thing like this (for EU the ex­pected util­ity, and Ac­tions the set of ac­tions, and b some de­fault ac­tion):

for q in­te­ger in N*n^2 to -N*n^2 (or­dered from high­est to low­est):
for a in Ac­tions:
if A()=a ⊢ EU=q/​n^2 then out­put a
else out­put b


Then the Löbian proof fails. The agent will fail to prove any of those “if” im­pli­ca­tions, un­til it proves “A()=”not cross” ⊢ EU=0″. Then it out­puts “not cross”; the de­fault ac­tion b is not rele­vant. Also not rele­vant, here, is the or­der in which a is sam­pled from “Ac­tions”.

# Toy model #6: Ra­tion­al­ity and par­tial preferences

2 Oct 2019 12:04 UTC
11 points
• In­ter­est­ing.

I have two is­sues with the rea­son­ing as pre­sented; the sec­ond one is more im­por­tant.

First of all, I’m un­sure about “Rather, the point is that the agent’s “coun­ter­fac­tual” rea­son­ing looks crazy.” I think we don’t know the agent’s coun­ter­fac­tual rea­son­ing. We know, by Löb’s the­o­rem, that “there ex­ists a proof that (proof of L im­plies L)” im­plies “there ex­ists a proof of L”. It doesn’t tell us what struc­ture this proof of L has to take, right? Who knows what coun­ter­fac­tu­als are be­ing con­sid­ered to make that proof? (I may be mi­s­un­der­stand­ing this).

Se­cond of all, it seems that if we change the last line of the agent to [else, “cross”], the ar­gu­ment fails. Same if we in­sert [else if A()=”cross” ⊢ U=-10, then out­put “cross”; else if A()=”not cross” ⊢ U=-10, then out­put “not cross”] above the last line. In both cases, this is be­cause U=-10 is now pos­si­ble, given cross­ing. I’m sus­pi­cious when the ar­gu­ment seems to de­pend so much on the struc­ture of the agent.

To de­velop that a bit, it seems the agent’s al­gorithm as writ­ten im­plies “If I cross the bridge, I am con­sis­tent” (be­cause U=-10 is not an op­tion). If we mod­ify the al­gorithm as I just sug­gested, then that’s no longer the case; it can con­sider coun­ter­fac­tu­als where it crosses the bridge and is in­con­sis­tent (or, at least, of un­known con­sis­tency). So, given that, the agent’s coun­ter­fac­tual rea­son­ing no longer seems so crazy, even if it’s as claimed. That’s be­cause the agent’s rea­son­ing needs to de­duce some­thing from “If I cross the bridge, I am con­sis­tent” that it can’t de­duce with­out that. Given that state­ment, then be­ing Löbian or similar seems quite nat­u­ral, as those are some of the few ways of deal­ing with state­ments of that type.

• Bayesian agents that know­ingly disagree

A minor stub, caveat­ing the Au­mann’s agree­ment the­o­rem; put here to refer­ence in fu­ture posts, if needed.

Au­mann’s agree­ment the­o­rem states that ra­tio­nal agents with com­mon knowl­edge of each other’s be­liefs can­not agree to dis­agree. If they ex­change their es­ti­mates, they will swiftly come to an agree­ment.

How­ever, that doesn’t mean that agents can­not dis­agree, in­deed they can dis­agree, and know that they dis­agree. For ex­am­ple, sup­pose that there are a thou­sand doors, and be­hind of these, there are goats, and be­hind one there is a fly­ing air­craft car­rier. The two agents are in sep­a­rate rooms, and a host will go into each room and ex­e­cute the fol­low­ing al­gorithm: they will choose a door at ran­dom among the that con­tain a goat. And, with prob­a­bil­ity , they will tell that door num­ber to the agent; with prob­a­bil­ity , they will tell the door num­ber with the air­craft car­rier.

Then each agent will have prob­a­bil­ity of the named door be­ing the air­craft car­rier door, and prob­a­bil­ity on each of the other doors; so the most likely door is the one named by the host.

We can mod­ify the pro­to­col so that the host will never name the same door to each agent (roll a D100; if it comes up 1, tell the truth to the first agent and lie to the sec­ond; if it comes up 2, do the op­po­site; any­thing else means tell a differ­ent lie to ei­ther agent). In that case, each agent will have a best guess for the air­craft car­rier, and the cer­tainty that the other agent’s best guess is differ­ent.

If the agents ex­changed in­for­ma­tion, they would swiftly con­verge on the same dis­tri­bu­tion; but un­til that hap­pens, they dis­agree, and know that they dis­agree.

# Stu­art_Arm­strong’s Shortform

30 Sep 2019 12:08 UTC
9 points

# Toy model piece #5: com­bin­ing par­tial preferences

12 Sep 2019 3:31 UTC
12 points

# Toy model piece #4: par­tial prefer­ences, re-re-visited

12 Sep 2019 3:31 UTC
9 points
• I like this anal­ogy. Prob­a­bly not best to put too much weight on it, but it has some in­sights.

• And whether those pro­grams could then perform well if their op­po­nent forces them into a very un­usual situ­a­tion, such as would not have ever ap­peared in a chess­mas­ter game.

If I sac­ri­fice a knight for no ad­van­tage what­so­ever, will the op­po­nent be able to deal with that? What if I set up a trap to cap­ture a piece, rely­ing on my op­po­nent not see­ing the trap? A chess­mas­ter play­ing an­other chess­mas­ter would never play a sim­ple trap, as it would never suc­ceed; so would the ML be able to deal with it?

• PS: the other ti­tle I con­sid­ered was “Why do peo­ple feel my re­sult is wrong”, which felt too con­de­scend­ing.

• I agree we’re not as good as we think we are. But there are a lot of things we do agree on, that seem triv­ial: eg “this per­son is red in the face, shout­ing at me, and punch­ing me; I de­duce that they are an­gry and wish to do me harm”. We have far, far, more agree­ment than ran­dom agents would.

• Your ti­tle seems clickbaity

Hehe—I don’t nor­mally do this, but I feel I can in­dulge once ^_^

hav­ing im­plicit ac­cess to cat­e­gori­sa­tion mod­ules that them­selves are valid only in typ­i­cal situ­a­tions… is not a way to gen­er­al­ise well

How do you know this?

Mo­ravec’s para­dox again. Chess­mas­ters didn’t eas­ily pro­gram chess pro­grams; and those chess pro­grams didn’t gen­er­al­ise to games in gen­eral.

Should we turn this into one of those con­crete ML ex­per­i­ments?

That would be good. I’m aiming to have a lot more prac­ti­cal ex­per­i­ments from my re­search pro­ject, and this could be one of them.

# Is my re­sult wrong? Maths vs in­tu­ition vs evolu­tion in learn­ing hu­man preferences

10 Sep 2019 0:46 UTC
19 points

# Sim­ple and com­pos­ite par­tial preferences

9 Sep 2019 23:07 UTC
11 points
• Hum… It seems that we can strat­ify here. Let rep­re­sent the val­ues of a col­lec­tion of vari­ables that we are un­cer­tain about, and that we are strat­ify­ing on.

When we com­pute the nor­mal­is­ing fac­tor for util­ity un­der two poli­cies and , we nor­mally do it as:

• , with .

And then we re­place with .

In­stead we might nor­mal­ise the util­ity sep­a­rately for each value of :

• Con­di­tional on , then , with .

The prob­lem is that, since we’re di­vid­ing by the , the ex­pec­ta­tion of is not the same .

Is there an ob­vi­ous im­prove­ment on this?

Note that here, to­tal util­i­tar­i­anism get less weight in large uni­verses, and more in small ones.

I’ll think more...

• How about a third AI that gives a (hid­den) prob­a­bil­ity about which one you’ll be con­vinced by, con­di­tional on which ar­gu­ment you see first? That hid­den prob­a­bil­ity is passed to some­one else, then the de­bate is run, and the re­sult recorded. If that third AI gives good cal­ibra­tion and good dis­crim­i­na­tion over mul­ti­ple ex­per­i­ments, then we can con­sider its pre­dic­tions ac­cu­rate in the fu­ture.

• Er, this nor­mal­i­sa­tion sys­tem way well solve that prob­lem en­tirely. If prefers op­tion (util­ity ), with sec­ond choice (util­ity ), and all the other op­tions as third choice (util­ity ), then the ex­pected util­ity of the ran­dom dic­ta­tor is for all (as gives util­ity , and gives util­ity for all ), so the nor­mal­ised weighted util­ity to max­imise is:

• .

Us­ing (be­cause scal­ing doesn’t change ex­pected util­ity de­ci­sions), the util­ity of any , , is , while the util­ity of is . So if , the com­pro­mise op­tion will get cho­sen.

Don’t con­fuse the prob­lems of the ran­dom dic­ta­tor, with the prob­lems of max­imis­ing the weighted sum of the nor­mal­i­sa­tions that used the ran­dom dic­ta­tor (and don’t con­fuse the other way, ei­ther; the ran­dom dic­ta­tor is im­mune to play­ers’ ly­ing, this nor­mal­i­sa­tion is not).