[Question] What are good ML/​AI related prediction /​ calibration questions for 2019?

I’m try­ing to come up with a set of ques­tions for self-cal­ibra­tion, re­lated to AI and ML.

I’ve writ­ten down what I’ve come up with so far be­low. But I am prin­ci­pally in­ter­ested in what other peo­ple come up with—thus the ques­tion metatype—both for ques­tions, and for pre­dic­tions for the ques­tions.

So far I have an in­suffi­cient num­ber of ques­tions to pro­duce any­thing like a nice cal­ibra­tion curve. I’ve also strug­gled with com­ing up with mean­ingful ques­tions.

I’ve rot13′ed my pre­dic­tions to avoid an­chor­ing any­one. I’m pretty un­cer­tain about most of these as point es­ti­mates how­ever.

On Ex­plic­itly Stated ML /​ Sys­tems Goals

It is (rel­a­tively) easy to de­ter­mine if these are fulfilled or not. The trade-off is that they likely have lit­tle re­la­tion to AGI.

1. OpenAI suc­ceeds in defeat­ing top pro teams on un­re­stricted Dota2

OpenAI has ex­plic­itly said that they wish to beat top hu­man teams in the MOBA Dota2. Their lat­est at­tempt to do so used self-play and fa­mil­iar policy-gra­di­ent strate­gies on an in­cred­ibly mas­sive scale to train, but still lost to top teams who won (rel­a­tively?) eas­ily.

I’m also in­ter­ested in peo­ple’s prob­a­bil­ities on whether OpenAI suc­ceeds, con­di­tional on OpenAI not in­clud­ing gen­uine al­gorith­mic nov­elty in their learn­ing meth­ods, al­though that’s a harder ques­tion to define be­cause of cloudi­ness around “al­gorith­mic nov­elty.”

My pre­dic­tion: Friragl-svir creprag.

2. Tesla suc­ceeds in a self-driv­ing car driv­ing coast-to-coast with­out in­ter­ven­tion.

Tesla sells cars with (os­ten­si­bly) all the hard­ware nec­es­sary for full self-driv­ing, and an in-house self-driv­ing re­search pro­gram that uses a mix of ML and hard-coded rules. They have a goal of giv­ing a demon­stra­tion au­tonomous coast-to-coast drive, al­though this goal has been re­peat­edly de­layed. There is wide­spread skep­ti­cism both of the sen­sor suite in Tesla cars and of the ma­tu­rity of their soft­ware.

My pre­dic­tion: Gra creprag.

3. Deep­Mind re­veals a skil­led RL-trained agent for StarCraft II.

After AlphaGo, Deep­Mind an­nounced that they would try to cre­ate an ex­pert-level agent for SCII. They’ve re­leased pre­limi­nary re­search re­lated to this topic, al­though what they’ve re­vealed is far from such an agent.

My pre­dic­tion: Nobhg svir creprag.

On Goals Not So Clearly Marked As Targets

1. Some­one gets a score on the Wino­grad Schema tests above 80%.

The Wino­grad Schemas are a se­ries of tests de­signed to test the com­mon-sense rea­son­ing prop­er­ties of a sys­tem. Modern ML strug­gles to get bet­ter than ran­dom chance—a 50% is ap­prox­i­mately equal to ran­dom guess­ing, and mod­ern state of the art gets less than 70%. (This is the best score I could find; there are sev­eral pa­pers which claim bet­ter scores, but these deal with sub­sets of the Wino­grad Schemas as far as I can tell. [I.e., the clas­sic “Hey, we got a bet­ter score… on a more con­strained dataset.“] I might be wrong about this, if I’m wrong please illu­mi­nate me.)

My pre­dic­tion: Nobhg svir creprag.

2. Re­in­force­ment Learn­ing Starts Working

This is a bad, un­clear goal. I’m not sure how to make it clearer and could use help.

There are a lot of ar­ti­cles about how re­in­force­ment learn­ing doesn’t work. It gen­er­al­izes in­cred­ibly poorly, and only suc­ceeds on com­plex tasks like Dota2 by play­ing literal cen­turies worth of games. If some al­gorithm were dis­cov­ered, such that one could get RL to work with kind of the same reg­u­lar­ity that su­per­vised learn­ing works, that would be amaz­ing. I’m still strug­gling with a way to rigor­ously note this. A 10x im­prove­ment in sam­ple effi­ciency on the Atari suite would (prob­a­bly?) fulfill this, but I’m not sure what else would. And it’s quite a PITA to keep track of what the cur­rent state-of-the-art on Atari is any­how.

My pre­dic­tion: Need to get this more defined.

No answers.