When to assume neural networks can solve a problem

Link post

Note: the origi­nal ar­ti­cle has been split into two since I think the two points were only vaguely re­lated, I will leave it as is here, since I’d rather not re-post stuff and I think the au­di­ence on LW might see the “link” be­tween the two sep­a­rate ideas pre­sented here.

A prag­matic guide

Let’s be­gin with a gen­tle in­tro­duc­tion in to the field of AI risk - pos­si­bly un­re­lated to the broader topic, but it’s what mo­ti­vated me to write about the mat­ter; it’s also a worth­while per­spec­tive to start the dis­cus­sion from. I hope for this ar­ti­cle to be part mus­ing on what we should as­sume ma­chine learn­ing can do and why we’d make those as­sump­tions, part refer­ence guide for “when not to be amazed that a neu­ral net­work can do some­thing”.

The Var­i­ous hues of AI risk

I’ve of­ten had a bone to pick against “AI risk” or, as I’ve referred to it, “AI alarmism”. When eval­u­at­ing AI risk, there are mul­ti­ple views on the lo­ca­tion of the threat and the per­ceived warn­ing signs.

1. The Bostro­mian position

I would call one of these view­points the “Bostro­mian po­si­tion”, which seems to be mainly pro­moted by MIRI, philoso­phers like Nick Bostrom and on fo­rums such as AI Align­ment.

It’s hard to sum­ma­rize with­out ap­par­ently straw-man ar­gu­ments, e.g. “AIX + Moore’s law means that all pow­er­ful su­per­hu­man in­tel­li­gence is dan­ger­ous, in­evitable and close.” That’s partly be­cause I’ve never seen a con­sis­tent top-to-bot­tom rea­son­ing for it. Its pro­po­nents always seem to start by as­sum­ing things which I wouldn’t hold as given about the ease of data col­lec­tion, the cost of com­put­ing power, the use­ful­ness of in­tel­li­gence.

I’ve tried to ar­gue against this po­si­tion, the sum­mary of my view can prob­a­bly be found in “Ar­tifi­cial gen­eral in­tel­li­gence is here, and it’s use­less”. Whilst—for the rea­sons men­tioned there—I don’t see it as par­tic­u­larly sta­ble, I think it’s not fun­da­men­tally flawed; I could see my­self ar­gu­ing pro or con.

2. The Stan­dard Position

Ad­vo­cated by peo­ple rang­ing from my friends, to poli­ti­ci­ans, to re­spectable aca­demics, to CEOs of large tech com­pa­nies. It is per­haps best sum­ma­rized in Stu­art Rus­sell’s book Hu­man Com­pat­i­ble: Ar­tifi­cial In­tel­li­gence and the Prob­lem of Con­trol.

This view­point is mainly based around real-world use cases for AI (where AI can be un­der­stood as “ma­chine learn­ing”). Peo­ple adopt­ing this per­spec­tive are not wrong in be­ing wor­ried, but rather in be­ing wor­ried about the wrong thing.

It’s wrong to be up­set by Face­book or Youtube us­ing an al­gorithm to con­trol and un­der­stand user prefer­ences and blam­ing it on “AI”, rather than on peo­ple not be­ing ed­u­cated enough to use TOR, in­stall a track­ing blocker, use Ublock Ori­gin and not cen­ter their en­tire life around con­spir­acy videos in their youtube feed or in anti-vac­ci­na­tion face­book groups.

It’s wrong to be alarmed by Ama­zon mak­ing peo­ple im­pulse-buy via a bet­ter un­der­stand­ing of their prefer­ences , and thus get­ting them into in­escapable debt, rather than by the le­gal­ity of pro­vid­ing un­eth­i­cally lev­er­aged debt so eas­ily.

It’s wrong to fuss about au­to­mated trad­ing be­ing able to cause sud­den large dips in the mar­ket, rather than about hav­ing mar­kets so un­sta­ble and so fo­cused on short-term trad­ing as to make this the start­ing point of the whole ar­gu­ment.

It’s wrong to worry about NLP tech­nol­ogy be­ing used to im­ple­ment pre­ven­tive polic­ing mea­sures, rather than about gov­ern­ments be­ing al­lowed to steal their cit­i­zens’ data, to re­quest back­doors into de­vices and to use pre­ven­tive polic­ing to be­gin with.

It’s wrong to worry about the Chi­nese Com­mu­nist Party us­ing fa­cial recog­ni­tion and track­ing tech­nol­ogy to limit civil rights; Worry in­stead about CCP rul­ing via a bru­tal dic­ta­tor­ship that im­ple­ments such mea­sures with­out any­body do­ing some­thing against it.

But I digress, though I ought to give a full re­but­tal of this po­si­tion at some point.

3. The mis­in­formed position

A view­point dis­tinct from the pre­vi­ous two. It stems from mi­s­un­der­stand­ing what ma­chine learn­ing sys­tems can already do. It ba­si­cally con­sists in pan­ick­ing over “new de­vel­op­ments” which ac­tu­ally have ex­isted for decades.

This view is es­pe­cially worth fight­ing against, since it’s based on mis­in­for­ma­tion. Whereas with cat­e­gories 1 and 2 I can see valid ar­gu­ments aris­ing for reg­u­lat­ing or bet­ter un­der­stand­ing ma­chine learn­ing sys­tems (or AI sys­tems in gen­eral), peo­ple in the third cat­e­gory just don’t un­der­stand what’s go­ing on, so they are prone to adopt any view out of sheer fear or need of be­long­ing, with­out truly un­der­stand­ing the mat­ter.

Un­til re­cently I thought this kind of task was bet­ter left to PBS. In hind­sight, I’ve seen oth­er­wise smart in­di­vi­d­u­als be­ing amazed that “AI” can solve a prob­lem which any­one that has ac­tu­ally worked with ma­chine learn­ing would have been able to tell you is ob­vi­ously solv­able and has been since for­ever.

Fur­ther­more, I think ad­dress­ing this view­point is rele­vant, as it’s ac­tu­ally challeng­ing and in­ter­est­ing. The ques­tion of “What are the prob­lems we should as­sume can be solved with ma­chine learn­ing?”, or even nar­rower and more fo­cused on cur­rent de­vel­op­ments “What are the prob­lems we should as­sume a neu­ral net­work should be able to solve?”, is one I haven’t seen ad­dressed much.

There are the­o­ries like PAC learn­ing and AIX which at a glance seem to re­volve around this, as it per­tains to ma­chine learn­ing in gen­eral, but if ac­tu­ally tried in prac­tice won’t yield any mean­ingful an­swer.

How peo­ple mi­s­un­der­stand what neu­ral net­works can do

Let’s look at the gen­eral pat­tern of fear gen­er­ated by mi­s­un­der­stand­ing the ma­chine learn­ing ca­pa­bil­ities we’ve had for decades.

  • Show a smart but rel­a­tively un­in­formed per­son—a philos­o­phy PhD or an older street-smart busi­ness­man—a deep learn­ing party trick.

  • Give them the most con­voluted and scary ex­pla­na­tion of why it works. E.g. Ex­plain Deep-Dream by us­ing in­com­plete neu­rolog­i­cal data about the vi­sual cor­tex and hu­man image pro­cess­ing, rather than just say­ing it’s the out­puts of a com­plex edge de­tec­tor overfit to rec­og­nize dog faces.

  • Wait for them to write an ar­ti­cle about it in VOX & co

An ex­am­ple that origi­nally mo­ti­vated this ar­ti­cle is Scott Alexan­der’s ar­ti­cle post about be­ing amazed that GPT-2 is able to learn how to play chess, poorly.

It seems to im­ply that GPT-2 play­ing chess well enough not to lose very badly against a medi­core op­po­nent (the au­thor) is im­pres­sive and sur­pris­ing.

Ac­tu­ally, the fact that a 1,500,000,000-pa­ram­e­ter model de­signed for se­quen­tial in­puts can be trained to kind of play chess is rather unim­pres­sive, to say the least. I would have been as­ton­ished if GPT-2 were un­able to play chess. Fully con­nected mod­els a hun­dred times smaller ( https://​​github.com/​​pbaer/​​neu­ral-chess) could do that more than 2 years ago.

The suc­cess­ful train­ing of GPT-2 is not a feat be­cause if a prob­lem like chess has been already solved us­ing var­i­ous ma­chine learn­ing mod­els we can as­sume it can be done with a generic neu­ral net­work ar­chi­tec­ture (e.g. any given FC net or a FC net with a few at­ten­tion lay­ers) hun­dreds or thou­sands of times larger in terms of pa­ram­e­ters.

When to as­sume a neu­ral net­work can solve a problem

In the GPT-2 ex­am­ple, trans­form­ers (i.e. the BERT-like mod­els in­spired by the “At­ten­tion is all you need” pa­per’s pro­posed de­sign) are pretty generic as far as NN ar­chi­tec­tures go. Not as generic as a fully con­nected net, ar­guably; they seem to perform more effi­ciently (in terms of train­ing time and model size) on many tasks, and they are much bet­ter on most se­quen­tial in­put tasks.

So when should we as­sume that such generic NN ar­chi­tec­tures can solve a prob­lem?

The an­swer might ease uniformed awe and might be rele­vant to ac­tua prob­lems – the kind for which “ma­chine learn­ing” might have been con­sid­ered, but with doubt whether it’s worth both­er­ing.

Play­ing chess de­cently is also a prob­lem already solved. It can be done us­ing small (com­pared to GPT-2) de­ci­sion trees and a few very sim­ple heuris­tics (see for ex­am­ple: https://​​github.com/​​Ad­nanZahid/​​Chess-AI-TDD). If a much smaller model can learn how to play “de­cently”, we should as­sume that a fairly generic, ex­po­nen­tially larger neu­ral net­work can do the same.

The rule of thumb is:

1.A neu­ral net­work can al­most cer­tainly solve a prob­lem if an­other ML al­gorithm has already suc­ceeded.

Given a prob­lem that can be solved by an ex­ist­ing ML tech­nique, we can as­sume that a some­what generic neu­ral net­work, if al­lowed to be sig­nifi­cantly larger, can also solve it.

This as­sump­tion doesn’t always hold be­cause:

  • a) Depend­ing on the ar­chi­tec­ture, a neu­ral net­work could eas­ily be un­able to op­ti­mize a given prob­lem. Play­ing chess might be im­pos­si­ble for a conv net­work with large win­dows and step size, even if it’s very big.

  • b) Cer­tain ML tech­niques have a lot of built-in heuris­tics that might be hard to learn for a neu­ral net­work. The ex­ist­ing ML tech­nique mustn’t have any crit­i­cal heuris­tics built into it, or at least you have to be able to in­clude the same heuris­tics into your neu­ral net­work model.

As we are fo­cus­ing mainly on gen­er­al­iz­able neu­ral net­work ar­chi­tec­tures (e.g. a fully con­nected net, which is what most peo­ple think of ini­tially when they hear “neu­ral net­work”), point a) is pretty ir­rele­vant.

Given that most heuris­tics are ap­plied equally well to any model, even for some­thing like chess, and that size can some­times be enough for the net­work to be able to just learn the heuris­tic, this rule ba­si­cally holds al­most ev­ery time.

I can’t re­ally think of a counter ex­am­ple here… Maybe some spe­cific types of nu­meric pro­jec­tions?

This is a rather bor­ing first rule, yet worth stat­ing as a start­ing point to build up from.

2. A neu­ral net­work can al­most cer­tainly solve a prob­lem very similar to ones already solved

Let’s say you have a model for pre­dict­ing the risk of a given cred­i­tor based on a few pa­ram­e­ters, e.g. cur­rent bal­ance, pre­vi­ous credit record, age, driver li­cense sta­tus, crim­i­nal record, yearly in­come, length of em­ploy­ment, {var­i­ous in­for­ma­tion about cur­rent eco­nomic cli­mate}, mar­i­tal sta­tus, num­ber of chil­dren, porn web­sites vis­ited in the last 60 days.

Let’s say this model “solves” your prob­lem, i.e. it pre­dicts risk bet­ter than 80% of your hu­man an­a­lysts.

But GDPR rolls along and you can no longer legally spy on some of your cus­tomers’ in­ter­net his­tory by buy­ing that data. You need to build a new model for those cus­tomers.

Your in­puts are now trun­cated af­ter and the cus­tomer’s on­line porn his­tory is no longer available (or rather ad­mit­tedly us­able).

Is it safe to as­sume you can still build a rea­son­able model to solve this prob­lem ?

The an­swer is al­most cer­tainly “yes; given our knowl­edge of the world, we can safely as­sume some­one’s porn brows­ing his­tory is not that rele­vant to their credit rat­ing as some of those other pa­ram­e­ters.

Another ex­am­ple: as­sume you know some­one else is us­ing a model, but their data is slightly differ­ent from yours.

You know a US-based snake-fo­cused pet shop that uses pre­vi­ous pur­chases to recom­mend prod­ucts and they’ve told you it’s done quite well for their bot­tom line. You are a UK-based par­rot-fo­cused pet shop. Can you trust their model or a similar one to solve your prob­lem, if trained on your data ?

Again, the right an­swer is prob­a­bly “yes”, be­cause the data is similar enough. That’s why build­ing a product recom­men­da­tion al­gorithm was a hot topic 20 years ago, but nowa­days ev­ery­one and their mom can just get a word­press plu­gin for it and get close to Ama­zon’s level.

Or, to get more se­ri­ous, let’s say you have a given al­gorithm for de­tect­ing breast can­cer that—if trained on 100,000 images with fol­low-up checks to con­firm the true di­ag­nos­tics—performs bet­ter than an av­er­age ra­diol­o­gist.

Can you as­sume that, given the abil­ity to make it larger, you can build a model to de­tect can­cer in other types of soft tis­sue, also bet­ter than a ra­diol­o­gist ?

Once again, the an­swer is yes. The ar­gu­ment here is longer, be­cause we aren’t so cer­tain, mainly be­cause of the lack of data. I’ve spent more or less a whole ar­ti­cle ar­gu­ing that the an­swer would still be yes.

In NLP the ex­act same neu­ral net­work ar­chi­tec­tures seem to be de­cently good at do­ing trans­la­tion or text gen­er­a­tion in any lan­guage, as long as it be­longs to the Indo Euro­pean fam­ily and there is a sig­nifi­cant cor­pus of data for it (i.e. equiv­a­lent to that used for train­ing the ex­tant mod­els for English).

Modern NLP tech­niques seem to be able to tackle all lan­guage fam­i­lies, and they are do­ing so with less and less data. To some ex­tent, how­ever, the similar­ity of the data and the amount of train­ing ex­am­ples are tightly linked to the abil­ity of a model in quickly gen­er­al­iz­ing for many lan­guages.

Or look­ing at image recog­ni­tion and ob­ject de­tec­tion/​box­ing mod­els, the main bot­tle­neck con­sists in large amounts of well-la­beled data, not the con­tents of the image. Edge cases ex­ist, but gen­er­ally all types of ob­jects and images can be rec­og­nized and clas­sified if enough ex­am­ples are fed into an ar­chi­tec­ture origi­nally de­signed for a differ­ent image task (e.g. a conv resi­d­ual net­work de­signed for ima­genet).

More­over, given a net­work trained on ima­genet, we can keep the ini­tial weights and bi­ases (es­sen­tially what the net­work “has learned”) in­stead of start­ing from scratch, and it will be able to “learn” on differ­ent datasets much faster from that start­ing point.

3. A neu­ral net­work can solve prob­lems that a hu­man can solve with small-sized dat­a­points and lit­tle to no context

Let’s say we have 20x20px black and white images of two ob­jects never seen be­fore; they are “ob­vi­ously differ­ent”, but not known to us . It’s rea­son­able to as­sume that, given a bunch of train­ing ex­am­ples, hu­mans would be rea­son­ably good at dis­t­in­guish­ing the two.

It is also rea­son­able to as­sume, given a bunch of ex­am­ples (let’s say 100), that al­most any neu­ral net­work of mil­lions of pa­ram­e­ters would ace this prob­lem like a hu­man.

You can vi­su­al­ize this in terms of amounts of in­for­ma­tion to learn. In this case, we have 400 pix­els of 255 val­ues each, so it’s rea­son­able to as­sume ev­ery pos­si­ble pat­tern could be ac­counted for with a few mil­lion pa­ram­e­ters in our equa­tion.

But what “small dat­a­points” means here is the crux of this defi­ni­tion.

In short, “small” is a func­tion of:

  • The size of your model. The big­ger a model, the more com­plex the pat­terns it can learn, the big­ger your pos­si­ble in­puts/​out­puts.

  • The gran­u­lar­ity of the an­swer (out­put). E.g 1,000 classes vs 10 classes, or an in­te­ger range from 0 to 1,000 vs one from 0 to 100,000. In this case 2.

  • The size of the in­put. In this case 400, since we have a 20x20 image.

Take a clas­sic image clas­sifi­ca­tion task like MNIST. Although a few minor im­prove­ments have been made, the state-of-the-art for MNIST hasn’t pro­gressed much. The last 8 years have yielded an im­prove­ment from ~98.5% to ~99.4%, both of which are well within the usual “hu­man er­ror range”.

Com­pare that to some­thing much big­ger in terms of in­put and out­put size, like ImageNet, where the last 8 years have seen a jump from 50% to al­most 90%.

In­deed, even with pre-CNN tech­niques, MNIST is ba­si­cally solve­able.

But even hav­ing defined “small” as a func­tion of the above, we don’t have the for­mula for the ac­tual func­tion. I think that is much harder, but we can come up with a “cheap” an­swer that works for most cases - in­deed, it’s all we need:

  • A given task can be con­sid­ered small when other tasks of equal or larger in­put and out­put size have already been solved via ma­chine learn­ing with more than one ar­chi­tec­ture on a sin­gle GPU

This might sound like a silly heuris­tic, but it holds sur­pris­ingly well for most “easy” ma­chine learn­ing prob­lems. For in­stance, the rea­son many NLP tasks are now more ad­vanced than most “video” tasks is size, de­spite the tremen­dous progress on images in terms of net­work ar­chi­tec­ture (which are much closer to the realm of video). The in­put & out­put size for mean­ingful tasks on videos is much larger; on the other hand, even though NLP is in a com­pletely differ­ent do­main, it’s much closer size-wise to image pro­cess­ing.

Then, what does “lit­tle to no con­text” mean ?

This is a harder one, but we can rely on ex­am­ples with “large” and “small” amounts of con­text.

  • Pre­dict­ing the stock mar­ket likely re­quires a large amount of con­text. One has to be able to dig deeper into the com­pa­nies to in­vest in; check on mar­ket fun­da­men­tals, re­cent earn­ing calls, the C-suite’s his­tory; un­der­stand the com­pany’s product; maybe get some in­for­ma­tion from it’s em­ploy­ees and cus­tomers, if pos­si­ble, get in­sider info about up­com­ing sales and merg­ers, etc.

You can try to pre­dict the stock mar­ket based purely on in­di­ca­tors about the stock mar­ket, but this is not the way most hu­mans are solv­ing the prob­lem.

  • On the other hand, pre­dict­ing the yield of a given print­ing ma­chine based on tem­per­a­ture and hu­midity in the en­vi­ron­ment could be solved via con­text, at least to some ex­tent. An en­g­ineer work­ing on the ma­chine might know that cer­tain com­po­nents be­have differ­ently in cer­tain con­di­tions. In prac­tice, how­ever, an en­g­ineer would ba­si­cally let the printer run, change the con­di­tions, look at the yield, then come up with an equa­tion. So given that data, a ma­chine learn­ing al­gorithm can also prob­a­bly come up with an equally good solu­tion, or even a bet­ter one.

In that sense, an ML al­gorithm would likely pro­duce re­sults similar to a math­e­mat­i­cian in solv­ing the equa­tion, since the con­text would be ba­si­cally non-ex­is­tent for the hu­man.

There are cer­tainly some limits. Un­less we test our ma­chine at 4,000 C the al­gorithm has no way of know­ing that the yield will be 0 be­cause the ma­chine will melt; an en­g­ineer might sus­pect that.

So, I can for­mu­late this 3rd prin­ci­ple as:

A generic neu­ral net­work can prob­a­bly solve a prob­lem if:

  • A hu­man can solve it

  • Tasks with similarly sized out­puts and in­puts have already been solved by an equally sized net­work

  • Most of the rele­vant con­tex­tual data a hu­man would have are in­cluded in the in­put data of our al­gorithm.

Feel free to change my mind (with ex­am­ples).

How­ever, this still re­quires eval­u­at­ing against hu­man perfor­mance. But a lot of ap­pli­ca­tions of ma­chine learn­ing are in­ter­est­ing pre­cisely be­cause they can solve prob­lems hu­mans can’t. Thus, I think we can go even deeper.

4. A neu­ral net­work might solve a prob­lem when we are rea­son­ably sure it’s de­ter­minis­tic, we provide any rele­vant con­text as part of the in­put data, and the data is rea­son­ably small

Here I’ll come back to one of my fa­vorite ex­am­ples—pro­tein fold­ing. One of the few prob­lems in sci­ence where data is read­ily available, where in­ter­pre­ta­tion and mean­ing are not con­founded by large amounts of the­o­ret­i­cal bag­gage, and where the size of a dat­a­point is small enough based on our pre­vi­ous defi­ni­tion. You can boil down the prob­lem to:

  • Around 2,000 in­put fea­tures (amino acids in the ter­tiary struc­ture), though this means our do­main will only cover 99.x% of pro­teins rather than liter­ally all of them.

  • Circa 18,000 cor­re­spond­ing out­put fea­tures (num­ber of atom po­si­tions in the ter­tiary struc­ture, aka the shape, need­ing to be pre­dicted to have the struc­ture).

This is one ex­am­ple. Like most NLP prob­lems, where “size” be­comes very sub­jec­tive, we could eas­ily ar­gue one-hot-en­cod­ing is re­quired for this type of in­puts; then the size sud­denly be­comes 40,000 (there’s 20 pro­teino­genic amino acids that can be en­coded by DNA) or 42,000 (if you care about se­leno­pro­teins and 44,000 if you care about niche pro­teins that don’t ap­pear in eu­kary­otes).

It could also be ar­gued that the in­put & out­put size is much smaller, since in most cases pro­teins are much smaller and we can mask & dis­card most of in­puts & out­puts for most cases.

Still, there are plenty of tasks that go from an, e.g. 255x255 pixel image to gen­er­ate an­other 255x255 pixel image (style al­ter­na­tion, re­s­olu­tion en­hance­ment, style trans­fer, con­tour map­ping… etc). So based on this I’d posite the pro­tein fold­ing data is rea­son­ably small and has been for the last few years.

In­deed, re­s­olu­tion en­hance­ment via neu­ral net­works and pro­tein fold­ing via neu­ral net­works came about at around the same time (with ev­ery similar ar­chi­tec­ture, mind you). But I digress; I’m mis­tak­ing a cor­re­la­tion for the causal pro­cess that sup­pos­edly gen­er­ated it. Then again, that’s the ba­sis of most self-styled “sci­ence” nowa­days, so what is one sin against the sci­en­tific method added to the pile ?

Based on my own fool­ing around with the prob­lem, it seems that even a very sim­ple model, sim­pler than some­thing like VGG, can learn some­thing ”mean­ingful” about pro­tein fold­ing. It can make guesses bet­ter than ran­dom and of­ten enough come within 1% of the ac­tual po­si­tion of the atoms, if given enough (135 mil­lions) pa­ram­e­ters and half a day of train­ing on an RTX2080. I can’t be sure about the ex­act ac­cu­racy, since ap­par­ently the ex­act eval­u­a­tion crite­rion here is pretty hard to find and/​or un­der­stand and/​or im­ple­ment for peo­ple that aren’t do­main ex­perts… or I am just daft, also a strong pos­si­bil­ity.

To my knowl­edge the first widely suc­cess­ful pro­tein fold­ing net­work AlphaFold, whilst us­ing some do­main-spe­cific heuris­tics, did most of the heavy lift­ing us­ing a resi­d­ual CNN, an ar­chi­tec­ture de­signed for cat­e­go­riz­ing images, some­thing as widely un­re­lated with pro­tein fold­ing as one can think of.

That is not to say any ar­chi­tec­ture could have tack­led this prob­lem as well. It rather means we needn’t build a whole new tech­nique to ap­proach this type of prob­lem. It’s the kind of prob­lem a neu­ral net­work can solve, even though it might re­quire a bit of look­ing around for the ex­act net­work that can do it.

The other im­por­tant thing here is that the prob­lem seems to be de­ter­minis­tic. Namely:

  • a) We know pep­tides can be folded into pro­teins, in the kind of in­ert en­vi­ron­ment that most of our mod­els as­sume, since that’s what we’ve always ob­served them to do.

  • b) We know that amino acids are one com­po­nent which can fully de­scribe a pep­tide

  • c) Since we as­sume the en­vi­ron­ment is always the same and we as­sume the fold­ing pro­cess it­self doesn’t much al­ter it, the prob­lem is not a func­tion of the en­vi­ron­ment (note, ob­vi­ously in the case of in-vitro fold­ing, in-vivo the prob­lem be­comes much harder)

The is­sue arises when think­ing about b), that is to say, we know that the uni­verse can de­ter­minis­ti­cally fold pep­tides; we know amino acids are enough to ac­cu­rately de­scribe a pep­tide. How­ever, the uni­verse doesn’t work with “amino acids”, it works with trillions of in­ter­ac­tions be­tween much smaller par­ti­cles.

So while the prob­lem is de­ter­minis­tic and self-con­tained, there’s no guaran­tee that learn­ing to fold pro­teins doesn’t en­tail learn­ing a com­plete model of par­ti­cle physics that is able to break down each amino acid into smaller func­tional com­po­nents. A few mil­lion pa­ram­e­ters wouldn’t be enough for that task.

This is what makes this 4th most generic defi­ni­tion the hard­est to ap­ply.

Some other ex­am­ples here are things like pre­dic­tive main­te­nance where ma­chine learn­ing mod­els are be­ing ac­tively used to tackle prob­lems hu­man can’t, at any rate not with­out math­e­mat­i­cal mod­els. For these types of prob­lems, there’s strong rea­sons to as­sume, based on the ex­ist­ing data, that the prob­lems are par­tially (mostly?) de­ter­minis­tic.

There are sim­pler ex­am­ples here, but I can’t think of any that, at the time of their in­cep­tion, didn’t already fall into the pre­vi­ous 3 cat­e­gories. At least, none that aren’t con­sid­ered re­in­force­ment learn­ing.

The vast ma­jor­ity of ex­am­ples fall within re­in­force­ment learn­ing, where one can solve an im­pres­sive amount of prob­lems once they are able to simu­late them.

Peo­ple can find op­ti­mal aero­dy­namic shapes, de­sign weird an­ten­nas to provide more effi­cient re­cep­tion/​cov­er­age, beat video games like DOT and Star­craft which are ex­po­nen­tially more com­plex (in terms of de­grees of free­dom) than chess or Go.

The prob­lem with RL is that de­sign­ing the ac­tual simu­la­tion is of­ten much more com­pli­cated than us­ing it to find a mean­ingful an­swer. RL is fun to do but doesn’t of­ten yield use­ful re­sults. How­ever, edge cases do ex­ist where de­sign­ing the simu­la­tion does seem to be eas­ier than ex­tract­ing in­fer­ences out of it. Be­sides that, the more simu­la­tions ad­vance based on our un­der­stand­ing of effi­ciently simu­lat­ing physics (in it­self helped by ML), the more such prob­lems will be­come ripe for the pick­ing.

In conclusion

I’ve at­tempted to provide a few sim­ple heuris­tics for an­swer­ing the ques­tion “When should we ex­pect that a neu­ral net­work can solve a prob­lem ?”. That is to say, to what prob­lems should you ap­ply neu­ral net­works, in prac­tice, right now. What prob­lems should leave you “unim­pressed” when solved by a neu­ral net­work ? For which prob­lems should our de­fault hy­poth­e­sis in­clude their solv­abil­ity, given enough ar­chi­tec­ture search­ing and cur­rent GPU ca­pa­bil­ities.

I think this is fairly use­ful—not only for not get­ting im­pressed when some­one shows us a party trick and tells us it’s AGI—but also for helping us quickly clas­sify a prob­lem as “likely solv­able via ML” and “un­likely to be solved by ML”

To re­cap, neu­ral net­works can prob­a­bly solve your prob­lem:

  1. [Al­most cer­tainty] If other ML mod­els already solved the prob­lem.

  2. [Very high prob­a­bil­ity] If a similar prob­lem has already been solved by an ML al­gorithm, and the differ­ences be­tween that and your prob­lem don’t seem sig­nifi­cant.

  3. [High prob­a­bil­ity] If the in­puts & out­puts are small enough to be com­pa­rable in size to those of other work­ing ML mod­els AND if we know a hu­man can solve the prob­lem with lit­tle con­text be­sides the in­puts and out­puts.

  4. [Rea­son­able prob­a­bil­ity] If the in­puts & out­puts are small enough to be com­pa­rable in size to those of other work­ing ML mod­els AND we have a high cer­tainty about the de­ter­minis­tic na­ture of the prob­lem (that is to say, about the in­puts be­ing suffi­cient to in­fer the out­puts).

I am not cer­tain about any of these rules, but this comes back to the prob­lem of be­ing able to say some­thing mean­ingful. PACL can give us al­most perfect cer­tainty and is math­e­mat­i­cally valid but it breaks down be­yond sim­ple clas­sifi­ca­tion prob­lems.

Com­ing up with this kind of rules doesn’t provide an ex­act de­gree of cer­tainty and they are de­rived from em­piri­cal ob­ser­va­tions. How­ever, I think they can ac­tu­ally be ap­plied to real world prob­lems.

In­deed, these are to some ex­tent the rules I do ap­ply to real world prob­lems, when a cus­tomer or friend asks me if a given prob­lem is “doable”. Th­ese seem to be pretty close to the rules I’ve no­ticed other peo­ple us­ing when think­ing about what prob­lems can be tack­led.

So I’m hop­ing that this could serve as an ac­tual prac­ti­cal guide for new­com­ers to the field, or for peo­ple that don’t want to get too in­volved in ML it­self, but have some datasets they want to work on.