# LawrenceC

Karma: 216
Page 1
• Hyper­bolic dis­count­ing leads to prefer­ences re­ver­sals over time: the clas­sic ex­am­ple is always prefer­ring a cer­tain $1 now to$2 to­mor­row, but prefer­ring a cer­tain $2 in a week to$1 in 6 days. This is a pretty clear sign that it never “should” be done—An agent with these prefer­ences might find them­selves pay­ing a cent to switch from $1 in 6 days to$2 in 7, then, 6 days later, pay­ing an­other cent to switch it back and get th $1 im­me­di­ately. How­ever, in prac­tice, even ra­tio­nal agents might ex­hibit hy­per­bolic dis­count­ing like prefer­ences (though no prefer­ence re­ver­sals): for ex­am­ple, right now I might not be­lieve you’re very trust­wor­thy and worry you might for­get to give me money to­mor­row. So I pre­fer$1 now to $2 to­mor­row. But if you ac­tu­ally are go­ing to give me$1 in 6 days, I might up­date to think­ing you’re quite trust­wor­thy and then be will­ing to wait an­other day to get \$2 in­stead. (See this pa­per for a more thor­ough dis­cus­sion of this pos­si­bil­ity: https://​​www.ncbi.nlm.nih.gov/​​pmc/​​ar­ti­cles/​​PMC1689473/​​pdf/​​T9KA20YDP8PB1QP4_265_2015.pdf)

• I be­lieve your defi­ni­tion of ac­cu­racy differs from the ISO defi­ni­tion (which is the us­age I learned in un­der­grad statis­tics classes, and also the us­age most on­line sources seem to agree with): a mea­sure­ment is ac­cu­rate in­so­far as it is close to the true value. By this defi­ni­tion, the rea­son the sec­ond graph is ac­cu­rate but not pre­cise is be­cause all the points are close to the true value. I’ll be us­ing that defi­ni­tion in the re­main­der of my post. That be­ing said, Wikipe­dia does claim your us­age is the more com­mon us­age of the word.

I don’t have a clear sense of how to an­swer your ques­tion em­piri­cally, so I’ll give a the­o­ret­i­cal an­swer.

Sup­pose our goal is to pre­dict some value . Let be our pre­dic­tor for (for ex­am­ple, we could have ask a sub­ject to pre­dict ). A nat­u­ral way to mea­sure ac­cu­racy for pre­dic­tion tasks is the mean squared er­ror , where a lower mean square er­ror is higher ac­cu­racy. The Bias Var­i­ance De­com­po­si­tion of mean squared er­ror gives us:

The first term on the right is the bias of your es­ti­ma­tor—how far the ex­pected value of your es­ti­ma­tor is from the true value. An un­bi­ased es­ti­ma­tor is one that, in ex­pec­ta­tion, gives you the right value (what you mean by “ac­cu­racy” in your post, and what ISO calls “true­ness”). The sec­ond term is the var­i­ance of your es­ti­ma­tor—how far your es­ti­ma­tor is, in ex­pec­ta­tion, from the av­er­age value of the es­ti­ma­tor. Rephras­ing a bit, this mea­sures how im­pre­cise your es­ti­ma­tor is, on av­er­age.

As both the terms on the right are always non-nega­tive, the bias and var­i­ance of your es­ti­ma­tor both lower bound your mean square er­ror.

How­ever, it turns out that there’s of­ten a trade off be­tween hav­ing an un­bi­ased es­ti­ma­tor and a more pre­cise es­ti­ma­tor, known ap­pro­pri­ately as the bias-var­i­ance trade-off. In fact, there are many clas­sic ex­am­ples in statis­tics of es­ti­ma­tors that are bi­ased but have lower MSE than any un­bi­ased es­ti­ma­tor. (Here’s the first one I found dur­ing Googling)

• For what it’s worth, though, as far as I can tell we don’t have the abil­ity to cre­ate an AI that will re­li­ably max­i­mize the num­ber of pa­per­clips in the real world, even with in­finite com­put­ing power. As Man­fred said, model-based goals seems to be a promis­ing re­search di­rec­tion for get­ting AIs to care about the real world, but we don’t cur­rently have the abil­ity to get such an AI to re­li­ably ac­tu­ally “value pa­per­clips”. There are a lot of prob­lems with model-based goals that oc­cur even in the POMDP set­ting, let alone when the agent’s model of the world or ob­ser­va­tion space can change. So I wouldn’t ex­pect any­one to be able to pro­pose a fully co­her­ent com­plete an­swer to your ques­tion in the near term.

It might be use­ful to think about how hu­mans “solve” this prob­lem, and whether or not you can port this be­hav­ior over to an AI.

If you’re in­ter­ested in this topic, I would recom­mend MIRI’s pa­per on value learn­ing as well as the rele­vant Ar­bital Tech­ni­cal Tu­to­rial.

• The rea­son for this is be­cause of the 5% chance for mis­takes. Copy­cat does worse vs both Sim­ple­ton and Copy­cat than Sim­ple­ton does against it­self.

• I’m re­ally con­fused by this.

• I think the term “Dark Arts” is used by many in the com­mu­nity to re­fer to generic, truth-ag­nos­tic ways of get­ting peo­ple to change their mind. I agree that Scott Adams demon­strates mas­tery of per­sua­sion tech­niques, and that this is in­deed not nec­es­sar­ily ev­i­dence that he is not a “ra­tio­nal­ist”.

How­ever, the spe­cific claim made by James_Miller is that it is a “model ra­tio­nal­ist dis­agree­ment”. I think that since Adams used the per­sua­sion tech­niques that Sta­bi­lizer men­tioned above, it’s pretty clear that it isn’t a model ra­tio­nal­ist dis­agree­ment.

• Awe­some! I heard a ru­mor that David Krueger (one of Ben­gio’s grad stu­dents) is one of the main peo­ple push­ing the safety ini­ti­a­tive there, can any­one con­firm?

• Thanks for the re­view! I definitely had the sense that Rosen was do­ing a lot of hand hold­ing and hand­wav­ing—it’s cer­tainly a very in­tro­duc­tory text. I’ve read both Rosen and Epp­stein and ac­tu­ally found Rosen bet­ter. The dis­crete math class I took in col­lege used Schein­er­man’s Math­e­mat­ics: A Discrete In­tro­duc­tion, which I also found to be worse than Rosen.

At the time I ac­tu­ally re­ally en­joyed the fact that Rosen went on tan­gents and helped me learn how to write a proof, since I was rel­a­tively lack­ing in math­e­mat­i­cal ma­tu­rity. I’d add that Rosen does cover proof writ­ing ear­lier in the book, but I sus­pect that MCS might do this job bet­ter. Given the tar­get au­di­ence of the MIRI re­search guide, I think it makes sense to switch over to MCS from Rosen.

• Thanks Søren! Could I ask what you’re plan­ning on cov­er­ing in the fu­ture? Is this mainly go­ing to be a tech­ni­cal or non-tech­ni­cal read­ing group?

I no­ticed that your group seems to have cov­ered a lot of the ba­sic read­ings on AI Safety, but I’m cu­ri­ous what your fu­ture plans.

• I haven’t heard much about ma­chine learn­ing used for fore­cast ag­gre­ga­tion. It would seem to me like many, many fac­tors could be use­ful in ag­gre­gat­ing fore­casts. For in­stance, some el­e­ments of one’s so­cial me­dia pro­file may be in­dica­tive of their fore­cast­ing abil­ity. Per­haps in­for­ma­tion about the ed­u­ca­tional differ­ences be­tween mul­ti­ple in­di­vi­d­u­als could provide in­sight on how cor­re­lated their knowl­edge is.

I think peo­ple are look­ing in to it: The Good Judg­ment Pro­ject team used sim­ple ma­chine learn­ing al­gorithms as part of their sub­mis­sion to IARPA dur­ing the ACE Tour­na­ment. One of the PhD stu­dents in­volved in the pro­ject wrote his dis­ser­ta­tion on a frame­work for ag­gre­gat­ing prob­a­bil­ity judg­ments. In the Good Judg­ment team at least, peo­ple are also in us­ing ML for other as­pects of pre­dic­tion—for ex­am­ple, pre­dict­ing if a given com­ment will change an­other per­son’s fore­casts—but I don’t think there’s been much suc­cess.

I think a real prob­lem is that there’s a real paucity of data for ML-based pre­dic­tion ag­gre­ga­tion com­pared to most ma­chine learn­ing pro­jects—a good pre­dic­tion tour­na­ment gets a cou­ple hun­dred fore­casts re­solv­ing in a year, at most.

Prob­a­bil­ity den­sity in­puts would also re­quire ad­di­tional un­der­stand­ing from users. While this could definitely be a challenge, many pre­dic­tion mar­kets already are quite com­pli­cated, and ex­ist­ing users of these tools are quite so­phis­ti­cated.

I think this is a big­ger hur­dle than you’d ex­pect if you’re im­ple­ment­ing these for pre­dic­tion tour­na­ments, though it might be pos­si­ble to do for pre­dic­tion mar­kets. (How­ever, I’m cu­ri­ous how you’re go­ing to im­ple­ment the mar­ket mechanism in this case.) Anec­do­tally speak­ing many of the peo­ple in­volved in GJ Open are not par­tic­u­larly math or tech savvy, even amongst the peo­ple who are good at pre­dic­tion.

• Fair point.

• I’m just say­ing that you have an in­finite se­quence of spheres with the prop­erty X. You’re say­ing that be­cause the se­quence is in­finite I can’t point to the last sphere and there­fore can’t say any­thing about it. I’m say­ing that be­cause all spheres in this se­quence have the prop­erty X, it doesn’t mat­ter that the se­quence is in­finite.

This isn’t true in gen­eral. Each nat­u­ral num­ber is finite, but the limit of the nat­u­ral num­bers is in­finite. Just be­cause each of the in­ter­me­di­ate shapes has prop­erty doesn’t mean the limit­ing shape has prop­erty X. Notably, in this case each of the in­ter­me­di­ate shapes has a non-zero amount of empty space, but the limit­ing shape has no empty space.

• Maybe think about the prob­lem this way:

Sup­pose there was some small ball in­side of your su­per-packed struc­ture that isn’t filled. Then we can fill this ball, and so the struc­ture isn’t su­per-packed. It fol­lows that the vol­ume of the empty space in­side of your struc­ture has to be 0.

Now, what does your su­per-packed struc­ture look like, given that it’s a empty cube that’s been filled?

EDIT: Nev­er­mind, just saw that Villiam gave a similar an­swer.

• I think they’re equiv­a­lent in a sense, but that bucket di­a­grams are still use­ful. A bucket can also oc­cur when you con­flate mul­ti­ple causal nodes. So in the first ex­am­ple, the kid might not even have a con­scious idea that there are three dis­tinct causal nodes (“spel­led os­hun wrong”, “I can’t write”, “I can’t be a writer”), but in­stead treats them as a sin­gle node. If you’re able to catch the flinch, in­tro­spect, and no­tice that there are ac­tu­ally three nodes, you’re already a big part of the way there.

• Thanks for post­ing this! I have a longer re­ply to Taleb’s post that I’ll post soon. But first:

When you read Silver (or your preferred rep­utable elec­tion fore­caster, I like An­drew Gel­man) post their fore­casts prior to the elec­tion, do you ac­cept them as equal or bet­ter than any es­ti­mate you could come up with? Or do you do a men­tal ad­just­ment or dis­count­ing based on some fac­tor you think they’ve left out?

I think it de­pends on the model. First, note that all fore­cast­ing mod­els only take into ac­count a spe­cific set of sig­nals. If there are fac­tors in­fluenc­ing the vote that I’m both aware of and don’t think are re­flected in the sig­nals, then you should up­date their fore­cast to re­flect this. For ex­am­ple, I think that be­cause Nate Silver’s model was based on polls that lag be­hind cur­rent events, if you had some ev­i­dence that a given event was re­ally bad or re­ally good for one of the two can­di­dates, such as the Comey let­ter or the Trump video, you should up­date in fa­vor of/​against a Trump Pres­i­dency be­fore it be­comes re­flected in the polls.

The math is based on as­sump­tions though that with high un­cer­tainty, far out from the elec­tion, the best fore­cast is 50-50.

Not re­ally. The key as­sump­tion is that your fore­casts are a Wiener pro­cess—a con­tin­u­ous time mar­t­in­gale with nor­mally-dis­tributed in­cre­ments. (I find this funny be­cause Taleb spends mul­ti­ple books railing against nor­mal­ity as­sump­tions.) This is kind of a trou­bling as­sump­tion, as Lu­mifer points out be­low. If your fore­cast is con­tin­u­ous (though it need not be), then it can be thought of as a time-trans­formed Wiener pro­cess, but as far as I can tell he doesn’t ac­count for the time-trans­for­ma­tion.

Every­one agrees that as un­cer­tainty be­comes re­ally high, the best fore­cast is 50-50. Con­versely, if you make a con­fi­dent fore­cast (say 90-10) and you’re prop­erly cal­ibrated, you’re also im­ply­ing that you’re un­likely to change your fore­cast by very much in the fu­ture (with high prob­a­bil­ity, you won’t fore­cast 1-99).

I think the ques­tion to ask is—how much volatility should make you doubt a fore­cast? If some­one’s fore­cast varied daily be­tween 1-99 and 99-1, you might learn to just ig­nore them, for ex­am­ple. Taleb tries to offer one an­swer to this, but makes some ques­tion­able as­sump­tions along the way and I don’t re­ally agree with his re­sult.

• We have the elec­tion es­ti­mate F a func­tion of a state vari­able W, a Wiener pro­cess WLOG

That doesn’t look like a rea­son­able start­ing point to me.

That’s fine ac­tu­ally, if you as­sume your fore­casts are con­tin­u­ous in time, then they’re con­tin­u­ous mar­t­in­gales and thus equiv­a­lent to some time-changed Wiener pro­cess. (EDIT: your fore­casts need not be con­tin­u­ous, my bad.) The prob­lem is that he doesn’t take into the time trans­for­ma­tion when he claims that you need to weight your sig­nal by 1/​sqrt(t).

He also has a typo in his state­ment of Ito’s Lemma which might af­fect his deriva­tion. I’ll check his math later.