# A Simple Introduction to Neural Networks

This post will dis­cuss what neu­ral net­works are (chap­ter I), why they work (chap­ter II), and how they are trained (chap­ter III). It serves as the tenth en­try in a se­quence on Ma­chine Learn­ing, but it’s writ­ten to be ac­cessible with­out hav­ing read any pre­vi­ous part. Most of chap­ters I and II also doesn’t re­quire any ad­vanced math.

Note that bold and ital­ics means “I am in­tro­duc­ing a new Ma­chine Learn­ing term.”

# I

Meet the neu­ral net­work:

The term “neu­ral net­work” de­scribes a class of Ma­chine Learn­ing pre­dic­tors which is in­spired by the ar­chi­tec­ture of the hu­man brain. It is an ex­tremely im­por­tant class: AlphaGo, AlphaGo Zero, AlphaS­tar, and GPT-2 are all based on neu­ral net­works.

The net­work above will be our run­ning ex­am­ple through­out this post. Let’s be­gin by defin­ing its com­po­nents. The blobs are called neu­rons. The ar­rows are called edges. To­gether, they define the graph un­der­ly­ing the neu­ral net­work. (This us­age of the term “graph” has noth­ing to do with a func­tion graph – in­stead, a graph is sim­ply a bunch of nodes with ar­rows be­tween them.) If this graph has no cy­cles (= edges go­ing around in a cir­cle), the neu­ral net­work is also called feed-for­ward. For this post, you can as­sume net­works are always feed-for­ward. The green neu­rons are called in­put neu­rons, the red neu­rons are called out­put neu­rons, and the blue neu­rons are called bias neu­rons.

If it is pos­si­ble to di­vide the en­tire set of neu­rons into an or­dered list of sub­sets, such that each neu­ron only points to­wards neu­ron in the sub­set next in line, the net­work is called lay­ered. For this post, you can as­sume net­works are always lay­ered. How­ever, note that non-lay­ered net­works do ex­ist: a net­work with edges and and can­not be di­vided into lay­ers.

As you can see, each neu­ron only points to­wards neu­rons in the next higher layer. Note that the first blue neu­ron, de­spite be­ing in the in­put layer, isn’t an in­put neu­ron.

So far, our view has been sim­plified. In re­al­ity, there’s more go­ing on:

The writ­ten on the edges are num­bers called weights. They will de­ter­mine how im­por­tant the value of the in­com­ing edge is. The writ­ten in­side of the neu­rons are func­tions (in the clas­si­cal sense: they take a value as in­put and out­put some­thing else). From here on, if I talk about “the fifth neu­ron,” I mean the neu­ron with writ­ten in it.

This pic­ture is a good guide to ex­plain how a neu­ral net­work ac­tu­ally works.

First, an in­put value (think of a num­ber) is fed into the in­put neu­ron, which for­wards it to neu­rons three and four. At the same time, neu­ron two (the bias neu­ron) sends the value to neu­rons three and four (send­ing the num­ber 1 is all that bias neu­rons ever do). At this point, the third neu­ron has re­ceived two val­ues – it now mul­ti­plies them with the weights on the re­spec­tive edges, so it mul­ti­plies the value com­ing from the first neu­ron with and the num­ber 1 com­ing from the bias neu­ron with . Then it adds both val­ues to­gether and ap­plies the func­tion to them. The re­sult of this pro­cess is . The fourth neu­ron has also re­ceived two val­ues, and it does the same thing (us­ing the weights and ), lead­ing to the term . With that done, the third neu­ron sends its value (the one we just com­puted) to the sixth neu­ron, the fourth sends its value to both the sixth and the sev­enth, and the fifth neu­ron (the bias neu­ron) sends the num­ber to both the sixth and the sev­enth neu­ron. Now they ap­ply their weights, then their func­tions, and so on. Even­tu­ally, the ninth neu­ron re­ceives a value, ap­plies to it, and out­puts the re­sult, which is the out­put of the net­work.

Re­call that the blue neu­rons are called bias neu­rons – that is be­cause they have no in­com­ing edges. In­stead of com­put­ing some­thing, they always out­put 1 and thus “bias” the weighted sum, which ar­rives at the neu­rons in the next layer, by a con­stant fac­tor. Each hid­den layer has ex­actly one bias neu­ron, and the in­put layer also has one. That way, each neu­ron ex­cept the in­put neu­rons has an in­com­ing edge from a bias neu­ron (which is im­por­tant). Also, note that the in­put neu­ron doesn’t ap­ply any func­tion to the in­put it re­ceives (but all other neu­rons do).

To de­scribe this pro­cess more for­mally, we need some ad­di­tional no­ta­tion for the value that goes into a neu­ron (the weighted sum of the val­ues on its in­com­ing edges) and the value which comes out (af­ter the func­tion is ap­plied to it). Let’s zoom into the sixth neu­ron:

When the in­put val­ues ar­rive, first the weighted sum is com­puted, then is ap­plied to it, and that will be the out­put value . Thus,

.

The new thing here is that we ex­press the out­put of any neu­ron in terms of the out­puts of neu­rons in the pre­vi­ous layer; thus, the equa­tion can de­scribe the out­put of any neu­ron in the net­work. No­tice that the value is miss­ing – that’s be­cause the fifth neu­ron is a bias neu­ron, which means it puts out the value 1, which means that is the same as . Also, note that the image shows the weights of the in­com­ing edges, but not the weight of the out­go­ing edge: the weights of an edge should always be thought of as be­long­ing to the neu­ron the edge is head­ing into.

Re­call that our neu­ral net­work can be di­vided into lay­ers:

This is use­ful for eval­u­at­ing the net­work – it al­lows us to do so layer-by-layer. In the be­gin­ning, we just know the in­put value(s) . We set

for each neu­ron in layer 1 (the in­put layer). This is just a re­la­bel­ing – noth­ing has hap­pened yet. But now, as­sum­ing we have all the out­put val­ues for some layer , we com­pute the out­put val­ues for each neu­ron in layer by setting

where is the set of in­dices of neu­rons that have an edge point­ing to­wards neu­ron . Since this equa­tion ap­plies for any layer, it al­lows us to eval­u­ate the en­tire net­work, one layer at a time. The out­put val­ues of the fi­nal lay­ers will be the out­put val­ues of the net­work. If you’re un­fa­mil­iar with the no­ta­tion, then never mind this – the equa­tion merely states what we have already dis­cussed, namely that the out­put of each neu­ron is com­puted in terms of the out­puts of neu­rons in the pre­vi­ous layer. The pur­pose of writ­ing it down as a sin­gle equa­tion is to make it less am­bigu­ous and more com­pact.

# II

Now you know what a neu­ral net­work is and how it is com­puted. But what are they good for, and why do they work so well?

Let’s start with the first ques­tion. If you already know what they’re good for, you can skip for­ward to sec­tion II.II for the sec­ond one. Other­wise, here’s a quick overview.

## II.I

If all val­ues (in­put val­ues, out­put val­ues, and val­ues sent be­tween neu­rons) are reg­u­lar num­bers, then the en­tire net­work im­ple­ments a func­tion , where is the num­ber of neu­rons in the in­put layer and the num­ber of neu­rons in the out­put layer. In our case, . The no­ta­tion means that always takes an el­e­ment in , which is a vec­tor of num­bers (all num­bers fed to in­put neu­rons), and re­turns an el­e­ment in , which is a vec­tor of num­bers (all num­bers re­turned by out­put neu­rons). And it does this for any pos­si­ble in­put – so for any in­put num­bers, it re­turns out­put num­bers. To say that the net­work “im­ple­ments a func­tion” means that we can ab­stract away from how ex­actly it works: in the end, its be­hav­ior is fully de­scribed by such a func­tion . This is nice be­cause func­tions are more well-un­der­stood ob­jects than neu­ral net­works.

So the util­ity of a neu­ral net­work is that it lets us eval­u­ate this func­tion . This can be valuable if does use­ful things. As an ex­am­ple, we can sup­pose that takes the source code of an image as in­put and out­puts a sin­gle bit in­di­cat­ing whether or not the image con­tains a cat. In that case, if we sup­pose the image is gray-scale with 1 byte per pixel and 200 by 200 pix­els, the func­tion is of the form . This no­ta­tion means that it takes a vec­tor of bits as in­put (which de­ter­mines an image), and out­puts a sin­gle bit, where 1 means “cat” and 0 means “no cat”. And again, it does this for ev­ery pos­si­ble vec­tor of bits. Hav­ing ac­cess to this func­tion would be use­ful, and thus, hav­ing a neu­ral net­work that im­ple­ments this func­tion – or more re­al­is­ti­cally, which im­ple­ments a similar func­tion (with some flaws) – would also be use­ful.

Another ex­am­ple could be the func­tion . This func­tion takes a vec­tor of bits as in­put which en­codes a tweet (280 char­ac­ter limit, each char­ac­ter is en­coded by 1 byte, weird sym­bols are ig­nored), and out­puts a num­ber be­tween 0 and 10 that rates how likely this tweet is to vi­o­late twit­ter’s terms of ser­vice. This func­tion would be very use­ful to learn – and, in fact, Jack Dorsey men­tioned that they are us­ing Ma­chine Learn­ing for stuff like this (whether they use neu­ral net­works, I don’t know, but they prob­a­bly do).

How to train a neu­ral net­work is some­thing we dis­cuss in de­tail in chap­ter III. For now, I’ll only men­tion that it works based on train­ing data. That is, we have some se­quence of ex­am­ples where the are in­puts to the neu­ral net­work (in the twit­ter ex­am­ple, tweets), and the are out­puts (in the twit­ter ex­am­ple, scores from 0 to 10 that have been as­signed by a hu­man). We then train the net­work to do well on the train­ing data, and hope that this will lead it to also do well in the real world (in the twit­ter ex­am­ple, we hope that it also la­bels new tweets ac­cu­rately). How­ever, this de­scrip­tion is not spe­cific to neu­ral net­works – it ap­plies to all in­stances of su­per­vised learn­ing, which is a big part of Ma­chine Learn­ing – and there are many differ­ent ma­chine learn­ing tech­niques out there. So what is it about neu­tral net­works in par­tic­u­lar that make them so pow­er­ful? We know that they are in­spired by the hu­man brain, but that is hardly an an­swer.

## II.II

This feels like an apt time to add the dis­claimer that I am very much not an ex­pert. The fol­low­ing is a re­sult of my search­ing the ex­ist­ing liter­a­ture for the best an­swers, but it is an in­com­plete search eval­u­ated with limited knowl­edge.

That said, one pos­si­ble an­swer is to point to­wards the ex­press­ibil­ity the­o­rems we have on neu­ral net­works. Most no­tably, there is the uni­ver­sal ap­prox­i­ma­tion the­o­rem – it states that a feed-for­ward neu­ral net­work with only a sin­gle hid­den layer (re­call that, in our ex­am­ple, we had two hid­den lay­ers) can ap­prox­i­mate any con­tin­u­ous & bounded func­tion un­der some rea­son­ably mild con­di­tions. Sounds pretty good – un­til you re­al­ize that the proof ba­si­cally con­sists of brute-forc­ing the ap­prox­i­ma­tion. This makes it kind of like point­ing out that the class of all hy­pothe­ses such that there is a C pro­gram of size at most 1 GB im­ple­ment­ing them can also solve a wide va­ri­ety of prac­ti­cally rele­vant prob­lems. A fairly damn­ing fact is that the neu­ral net­works con­structed in the proof have ex­po­nen­tially many neu­rons in the hid­den layer – just like there are ex­po­nen­tially many C pro­grams of with at most bits. (“Ex­po­nen­tial” means that the num­ber is an ex­po­nen­tial func­tion of , in this case, .) Re­sults stat­ing that some­thing can be done with ex­po­nen­tial re­sources are gen­er­ally unim­pres­sive.

A much sim­pler re­sult that can be rigor­ously proved in just a few lines states that a neu­ral net­work (with only one hid­den layer) can im­ple­ment any func­tion of the kind . This sort of func­tion is called a boolean func­tion, and it can be thought of as tak­ing in in­put bits and out­putting a bi­nary yes/​no. More pre­cisely, the the­o­rem states that there is an ar­chi­tec­ture for a neu­ral net­work (i.e., a graph) such that, for each pos­si­ble func­tion , there is a set of weights such that the neu­ral net­work defined by that ar­chi­tec­ture & that set of weights im­ple­ments that func­tion. More­over, ev­ery neu­ron in this net­work ap­plies the same, very sim­ple func­tion . But once again, the num­ber of neu­rons in the hid­den layer of that ar­chi­tec­ture is ex­po­nen­tial in . And there is also a sec­ond the­o­rem which shows that this is a lower bound – an (ar­chi­tec­ture for a) neu­ral net­work im­ple­ment­ing this kind of func­tion has to have an ex­po­nen­tial num­ber of neu­rons in , no mat­ter how smartly it is de­signed.

Ver­dict: un­satis­fy­ing.

## II.III

If the uni­ver­sal ap­prox­i­ma­tion the­o­rem can­not an­swer the ques­tion, what then? The best ex­pla­na­tion I’ve found is from a pa­per co-au­thored by Max Teg­mark – the guy who wrote Our Math­e­mat­i­cal Uni­verse and is the pres­i­dent of the Fu­ture of Life In­sti­tute. It’s ti­tled “Why does deep and cheap learn­ing work so well?” and it be­gins with the ob­ser­va­tion that even sim­ple tasks such as image-clas­sifi­ca­tion are, in some sense, im­pos­si­ble to solve by a neu­ral net­work. As men­tioned, a 200-by-200 bit gray-scale image with just a sin­gle byte for each pixel defines the in­put space . That space has size (i.e., it con­sists of en­cod­ings for images), which is roughly . (Note that the dot is not a dec­i­mal point.) But there’s no rea­son to be so mod­est; neu­ral net­works are, in fact, used to clas­sify much larger images. Let’s say our images are 1000-by-1000 pix­els with 4 bytes per pixel. In that case, the space has el­e­ments, which is about . On the other hand, a neu­ral net­work with, say, 1000 neu­rons only has pos­si­ble com­bi­na­tions, where is the num­ber of differ­ent val­ues each weight can take. Even if we’re overly gen­er­ous and as­sume that each weight can lead to a 100 mean­ingfully differ­ent con­figu­ra­tions, that would only provide us with the abil­ity to differ­en­ti­ate be­tween at most differ­ent cases – barely enough for tiny gray-scale images, but nowhere near enough for de­cently sized images. This fol­lows be­cause a neu­ral net­work can­not pos­si­bly differ­en­ti­ate more images than it has differ­ent con­figu­ra­tions. And we didn’t even talk about the amounts of train­ing data that are available in prac­tice, which are nowhere near as large.

This sug­gests that the situ­a­tion in prac­tice looks some­thing like this:

And one is left to won­der how this could ever work out.

The solu­tion the pa­per sug­gests is to re­place the above image with this one:

That is, the prob­lem state space has ac­tu­ally been gen­er­ated by a phys­i­cal pro­cess that is far sim­pler to spec­ify than the state space it­self and usu­ally even sim­pler than the neu­ral net­work. For ex­am­ple, cats (or even pic­tures of cats) are pretty com­pli­cated math­e­mat­i­cal ob­jects, but the pro­cess which has gen­er­ated cats (namely evolu­tion) is rel­a­tively sim­ple. Similarly, the phrase “draw cat pic­tures,” which one can give to a bunch of artists, is far sim­pler than the images they will pro­duce. Thus, while it is true that a neu­ral net­work with ~1000 neu­rons can only give rea­son­able an­swers to a tiny sub­set of images in image-space, this is ac­cept­able: most of the images we care about are con­tained in that sub­set. Per­haps most strik­ingly, the stan­dard model of physics only has 32 pa­ram­e­ters, ac­cord­ing to the pa­per.

Now, you might re­call that the ti­tle of the pa­per men­tioned “deep learn­ing.” This term refers to a neu­ral net­work with a lot of hid­den lay­ers, as sup­posed to just one or two. The fact that deep learn­ing has had mas­sive suc­cess in prac­tice stands in con­trast to the uni­ver­sal ap­prox­i­ma­tion the­o­rem, which ar­gues us­ing a net­work of only a sin­gle hid­den layer. It is prob­a­bly fair to say that “shal­low” net­works are more similar to other kinds of pre­dic­tors and less analo­gous to what the hu­man brain does.

So why neu­ral nets – and more speci­fi­cally, deep learn­ing – rather than any other tech­nique? On this point, Lin et al. (Teg­mark is part of the “et al.”) ar­gue that that na­ture is fun­da­men­tally hi­er­ar­chi­cal:

One of the most strik­ing fea­tures of the phys­i­cal world is its hi­er­ar­chi­cal struc­ture. Spa­tially, it is an ob­ject hi­er­ar­chy: el­e­men­tary par­ti­cles form atoms which in turn form molecules, cells, or­ganisms, planets, so­lar sys­tems, galax­ies, etc. Causally, com­plex struc­tures are fre­quently cre­ated through a dis­tinct se­quence of sim­pler steps.

So we can re­think the pic­ture yet again to ar­rive at some­thing like this:

So the idea is not just that the small com­plex­ity of the gen­er­a­tion pro­cess makes it fea­si­ble for neu­ral net­works to ac­com­plish tasks such as rec­og­niz­ing faces, but also that it may learn by re­vers­ing this gen­er­a­tive pro­cess. Fur­ther­more, since this pro­cess nat­u­rally takes place in mul­ti­ple steps, the most effec­tive neu­ral net­works will them­selves be perform­ing their com­pu­ta­tions in mul­ti­ple steps. The pa­per also proves some­thing to the effect that gen­er­a­tive hi­er­ar­chies can be “op­ti­mally re­versed one step at a time,” but I can’t tell you what ex­actly that means since the state­ment is for­mal­ized in in­for­ma­tion the­ory.

Now, do “mul­ti­ple steps” trans­late into “mul­ti­ple lay­ers”? The an­swer to that is a pretty clear yes (here is where we need to in­tro­duce a bit of math). No­tice that, even if a neu­ral net­work has many in­put notes, it can be con­sid­ered to mod­ify just a sin­gle ob­ject – namely a vec­tor , where is the num­ber of in­put neu­rons. Now the en­tire func­tion which the neu­ral net­work com­putes can be writ­ten down like so:

where is the num­ber of lay­ers, the are lin­ear func­tions ( ma­tri­ces), and the are the func­tions that the neu­rons im­ple­ment, ap­plied co­or­di­nate-wise. The sym­bol means that we first ap­ply the right func­tion, then the left one. If the in­put vec­tor is fed into the neu­ral net­work, it first passes un­changed to the in­put neu­rons (re­call that, un­like all other neu­rons, they don’t com­pute a func­tion). Then they are scaled ac­cord­ing to the weight val­ues be­tween lay­ers 1 and 2, which cor­re­sponds to the lin­ear trans­for­ma­tion , our first trans­for­ma­tion. (It’s in­dexed with 2 be­cause it leads into layer 2.) Now the in­put neu­rons in the sec­ond layer (the first hid­den layer) ap­ply their func­tions, which cor­re­sponds to the com­po­nent-wise ap­pli­ca­tion of (pro­vided all neu­rons on the layer im­ple­ment the same func­tion, which is an as­sump­tion we make here). Then the out­puts are scaled ac­cord­ing to the weight val­ues be­tween lay­ers 2 and 3, which cor­re­sponds to the lin­ear trans­for­ma­tion , and so on. In the end, the out­put neu­rons ap­ply their func­tion com­po­nent-wise.

Hence the neu­ral net­work does in­deed ap­ply one (pair of) trans­for­ma­tions per layer. How­ever, no-one knows what neu­ral net­works are ac­tu­ally do­ing most of the time, so all of this is still largely spec­u­la­tive.

Nonethe­less, here is an in­ter­est­ing ob­ser­va­tion that backs up Lin et al.’s claims. (Skip this part if you have never taken a course in Lin­ear Alge­bra.) Sup­pose we de­clare that the neu­rons aren’t al­lowed to do any­thing (i.e., they just im­ple­ment the iden­tity func­tion). In that case, the neu­ral net­work de­com­poses into the transformations

Con­cate­nat­ing many lin­ear func­tions yields it­self a lin­ear func­tion, so this neu­ral net­work does noth­ing ex­cept ap­ply a sin­gle ma­trix to the in­put vec­tor, namely the ma­trix .

Let’s sup­pose the in­put el­e­ments form a ma­trix , and the goal of the net­work is to mul­ti­ply that ma­trix with an­other ma­trix . Then, this can be triv­ially re­al­ized with just an in­put and an out­put layer (no hid­den lay­ers), by set­ting . This cor­re­sponds to the stan­dard way of do­ing ma­trix mul­ti­pli­ca­tion, which re­quires op­er­a­tions. But there are also (much more so­phis­ti­cated) al­gorithms with funny time com­plex­ities such as – point be­ing, they’re faster for very large ma­tri­ces. Thus, since neu­ral net­works are uni­ver­sal ap­prox­i­ma­tors, there ex­ists a neu­ral net­work that im­ple­ments ma­trix mul­ti­pli­ca­tion with similar time com­plex­ity. This is im­pos­si­ble to repli­cate on a shal­low net­work. (Re­lated: proofs in lin­ear alge­bra of­ten ar­gue by de­com­pos­ing a ma­trix into smaller ones.)

The pa­per calls these re­sults “no-flat­ten­ing the­o­remsand goes on to refer­ence a bunch of other ex­am­ples. So it is pos­si­ble to rigor­ously prove that at least some spe­cific neu­ral net­works can­not be flat­tened with­out los­ing effi­ciency. And it’s also worth not­ing that there ex­ists a ver­sion of the uni­ver­sal ap­prox­i­ma­tion the­o­rem which man­ages to re­strict a net­work’s width in fa­vor of more hid­den lay­ers.

That’s all I have on the why ques­tion. Has it been a satis­fy­ing an­swer? I would vote a firm no – it seems a rather long way from be­ing satis­fac­tory. Nonethe­less, I at least feel like I now have some nonzero in­sight into why neu­ral net­works are pow­er­ful, which is more than I had be­fore read­ing the pa­per.

# III

This fi­nal chap­ter is where we look at how to train a neu­ral net­work. From this point on­ward, the post will get a lot more mathy – in par­tic­u­lar, it as­sumes you know deriva­tives and the chain rule. If you don’t, feel free to skip right to the end.

First, we need to dis­cuss which part of the net­work we wish to learn. In prin­ci­ple, there are many pos­si­ble ap­proaches, but in prac­tice, one usu­ally fixes ev­ery­thing ex­cept the weights. That is, one fixes

• the un­der­ly­ing graph (i.e., all neu­rons and edges)

• the func­tions (which aren’t very com­pli­cated, they might even all be the same)

and then looks for the set of weights with which the net­work performs best. The above is also called the ar­chi­tec­ture of the net­work. Any given ar­chi­tec­ture can im­ple­ment a wide va­ri­ety of differ­ent func­tions (by giv­ing it differ­ent weights).

As men­tioned, we as­sume that we’re in the set­ting of su­per­vised learn­ing, where we have ac­cess to a se­quence of train­ing ex­am­ples. Each is an in­put to the net­work for which is the cor­re­spond­ing cor­rect out­put.

The idea is now to train the neu­ral net­work to do well on the train­ing data and hope that this will cause it to do well in the real world.

How do we do that? Given in­finite com­put­ing power, there is a sim­ple al­gorithm: we choose some finite en­cod­ing for real num­bers, say 64-bit en­cod­ing, and perform an ex­haus­tive search over all pos­si­ble com­bi­na­tions of weight val­ues; then we test the net­work with each one and stick with the com­bi­na­tion that works best. Un­for­tu­nately, there are many com­bi­na­tions of weight val­ues, so this ap­proach would re­quire rounds, where each round con­sists of test­ing the neu­ral net­work with those weights on all of . This is in­fea­si­ble even for our small net­work, there­fore we need a more effi­cient ap­proach. The al­gorithm we will study is called back­prop­a­ga­tion, and it re­lies on some­thing called Stochas­tic Gra­di­ent Des­cent.

## III.I

Here is how it works. Sup­pose we give the net­work some pre­limi­nary weights so that it is fully defined, and we fo­cus on one par­tic­u­lar train­ing point, .

If we feed into the net­work, we will get some num­ber as the re­sult. Let’s call it .

This pre­dic­tion will be good if is close to and bad oth­er­wise, since is the cor­rect out­put for . Thus, the differ­ence be­tween them is a mea­sure for the qual­ity of the pre­dic­tion. For unim­por­tant rea­sons, we square the differ­ence, so we take as our mea­sure for how good our net­work did (on this one point). If is small, it did well; if it’s large, it did poorly.

This differ­ence de­pends on all of the weight val­ues , be­cause all of them were used to com­pute . Sup­pose we ar­bi­trar­ily pick out one such weight , and pre­tend that all other weights are fixed. Then we can con­sider the differ­ence as a func­tion of that weight. We call that func­tion the loss func­tion. To re­it­er­ate: the loss func­tion takes a value for the weight , and out­puts the value of given that we ap­ply the net­work with value for weight . (Re­mem­ber that all other weights are fixed.) And if we can some­how com­pute this func­tion, we can also take the deriva­tive. For the weight , this will be the num­ber .

If is pos­i­tive, we know that will in­crease if we in­crease the value for (and con­versely, it will de­crease if we de­crease the value for ). If it is nega­tive, we know that will de­crease if we in­crease the value for . In both cases, we know how to change so that the term will de­crease – which is what we want. Now we do this for each weight sep­a­rately, i.e., we change each weight such that de­creases. The re­sult of this step is a neu­ral net­work in which all weights are a lit­tle bit bet­ter than be­fore.

And now – start­ing with our new weights – we do all of this again with for the next train­ing point. That will again up­date our weights, and we can re­peat the pro­cess for a third point, and so on. After up­dat­ing on all the train­ing points we have, the fi­nal weights are the al­gorithm’s out­put. How much we change each weight is de­ter­mined by a pa­ram­e­ter of the al­gorithm called the step size. It usu­ally de­creases over the course of the al­gorithm, i.e., we move the weights more in the be­gin­ning when we know they aren’t yet any good.

## III.II

The re­main­ing ques­tion is how to com­pute the deriva­tive of the loss func­tion with re­gard to each weight. (From here on, “deriva­tive” will always mean “deriva­tive of the loss func­tion.”)

Re­mem­ber how the net­work was eval­u­ated layer by layer? Eval­u­at­ing the deriva­tive with re­gard to each weight will also be done layer by layer, but this time in re­verse. That’s why it’s called back­prop­a­ga­tion. Let’s be­gin by com­put­ing the weight of an edge lead­ing into the fi­nal layer – let’s take .

At some point in chap­ter 1, we’ve zoomed into the sixth neu­ron to see that there are ac­tu­ally three dis­tinct el­e­ments we need to differ­en­ti­ate: the in­put to the neu­ron, i.e., the sum of the weighted val­ues from the in­com­ing edges (), the func­tion of the neu­ron () and the out­put af­ter ap­ply­ing the func­tion . Now we need to do the same with the ninth neu­ron.

Note that is the same as . We’re in­ter­ested in the term . Us­ing the chain rule, we can write

The first term is . (This no­ta­tion is com­pact but some­what sloppy – in truth, is a func­tion which needs to be ap­plied to a pa­ram­e­ter.) The sec­ond term is the out­put of the ninth neu­ron with re­gard to its in­put; it de­pends on the func­tion . And the third term is easy; we have

Note that these are all known pa­ram­e­ters – we have ob­tained the by eval­u­at­ing the neu­ral net­work on the in­put .

At this point, we perform a non-ob­vi­ous change of course. Even though the value is what we re­ally care about, to pro­cess the en­tire net­work, it will be more use­ful to fo­cus on the term . That is, we will fo­cus on the deriva­tive with re­gard to the in­put value of the neu­ron of the fi­nal layer, rather than with re­gard to the weights of the in­com­ing edges. This is eas­ily done – we just drop the third term in the chain rule above, i.e., . Once we have those deriva­tives for the en­tire net­work (i.e., the deriva­tives with re­gard to the in­put val­ues of ev­ery neu­ron), it will be easy to re­con­struct the deriva­tives with re­gard to the weights – we just mul­ti­ply with the third term again.

Thus, the way we will pro­cess the en­tire net­work is via the fol­low­ing three steps

• com­pute the deriva­tive with re­gard to the in­put neu­rons of the fi­nal layer

• for each layer (start­ing from the sec­ond last), com­pute the deriva­tive with re­gard to the in­puts of the neu­rons of layer in terms of the deriva­tives with re­gard to the in­puts of the neu­rons of layer

• for each weight , com­pute

We’ve already dealt with step #1, and step #3 is fully de­scribed by the equa­tion above. All that’s left is to demon­strate how we can do step #2. So sup­pose we have already pro­cessed lay­ers three and four, and now we wish to pro­cess layer two. Then we already know the deriva­tives and and (check with the net­work above). We now demon­strate how, us­ing this, we can com­pute the deriva­tive with re­gard to an in­put neu­ron of the sec­ond layer. Let’s choose an ex­am­ple; they all work the same way. Once again, we zoom into the rele­vant part of the net­work:

Now we have

The sec­ond term de­pends on . What about the first? Here is where the last com­pli­ca­tion comes in. The way in­fluences the er­ror can­not be re­duced to any sin­gle deriva­tive in the third layer, be­cause the fourth neu­ron feeds into both the sixth and the sev­enth neu­rons. Thus, we have the ap­ply the ad­vanced chain rule to write

Now and are terms we have com­puted in pre­vi­ous rounds, and the other two terms are easy – we have and .

## III.III

If you have read the pre­vi­ous posts in this se­quence, or just know about gra­di­ent de­scent, you might ob­ject that the learn­ing pro­cess will get stuck in a lo­cal min­i­mum be­cause the loss-func­tion is non-con­vex. In less jar­gony lan­guage: sup­pose we have gone through a few rounds of the train­ing pro­cess with back­prop­a­ga­tion. It could be that now our weights are lo­cally op­ti­mal – any small change will make the net­work perform worse – but if we changed them far enough, we could still im­prove its perfor­mance. We don’t want the step size to be too large; oth­er­wise, we would keep jump­ing over the tar­get. Thus, if the net­work is large, the al­gorithm will in­evitably get stuck in a lo­cal min­i­mum.

The prac­ti­cal solu­tion is sim­ply to run the en­tire back­prop­a­ga­tion al­gorithm a bunch of times, say 1000 times, start­ing from differ­ent ini­tial weights. The hope is that, if each run of the al­gorithm finds some lo­cal op­ti­mum, and then we take the best of those, the re­sult could still be pretty good. While re­peat­ing the en­tire al­gorithm 1000 times may seem in­effi­cient, re­call that the brute force ap­proach would re­quire up­wards of steps, where is the num­ber of neu­rons. For large net­works, even run­ning back­prop­a­ga­tion a mil­lion times would be vastly more effi­cient than us­ing a brute-force search.

It’s worth not­ing that neu­ral net­works are pop­u­lar be­cause they perform well in prac­tice, not be­cause of the­o­ret­i­cal re­sults. Given how im­por­tant they are, the ex­ist­ing the­ory is rather un­der­whelming. But the re­sults are no less im­pres­sive.

Fi­nally, I’ve skipped over the func­tions in­side of the neu­rons all post long. That’s be­cause I think they’re quite nonessen­tial, and in­tro­duc­ing them would com­pli­cate the key ideas. But for the sake of com­plete­ness: a choice still found in text­books is the lo­gis­tic func­tion , given by the equation

Its graph looks like this (pic­ture from Wikipe­dia):

So it sim­ply ranges be­tween 0 and 1; it gets close to 0 for very small (nega­tive) in­puts, and close to 1 for very large in­puts. This par­tic­u­lar func­tion has the lovely prop­erty that , i.e., its deriva­tive is very easy to com­pute. How­ever, I’ve been told that this func­tion is not com­monly used in prac­tice any­more, and in­stead, a pop­u­lar choice is some­thing called the rec­tifier func­tion defined by

Its graph looks like this,

and its deriva­tive is no harder to com­pute, as it’s ei­ther 1 or 0.

• Slight ntipick: Sim­ply be­cause lo­gis­tic isn’t ac­tu­ally used in prac­tice any­more, it might be bet­ter to start peo­ple new to the se­quence with a bet­ter ac­ti­va­tion func­tion like reLU or tanh?

When I was first learn­ing this ma­te­rial, due to peo­ple men­tion­ing sig­moid a lot, I thought it would be a good de­fault, and then I learned later on that it’s ac­tu­ally not the ac­ti­va­tion func­tion of choice any­more, and hasn’t been for a while. (See, for ex­am­ple, Yann LeCun here in 1998 on why nor­mal sig­moid has draw­backs.)

• Okay – since I don’t ac­tu­ally know what is used in prac­tice, I just added a bit para­phras­ing your cor­rec­tion (which is con­sis­tent with a quick google search), but not sel­l­ing it as my own idea. Stuff like this is the down­side of some­one who is just learn­ing the ma­te­rial writ­ing about it.

• Gen­eral:

This was a very clear ex­pla­na­tion. Sim­plifi­ca­tions were used, then dis­carded, at good points. Every­thing built up very well, and I feel I have a much clearer un­der­stand­ing—and more spe­cific ques­tions. (Like how is the num­ber of nodes/​lay­ers cho­sen?)

Speci­fics:

f=Aℓ∘⋯∘A2

What’s the “ℓ”? (I’m un­clear on how one iter­ates from L to 2.)

Nonethe­less, I at least feel like I now have some nonzero in­sight into why neu­ral net­works are pow­er­ful, which is more than I had be­fore read­ing the pa­per.

And you’ve ex­plained the ‘ML is just ma­trix mul­ti­pli­ca­tion no one un­der­stands’ joke, which I ap­pre­ci­ate.

As men­tioned, we as­sume that we’re in the set­ting of su­per­vised learn­ing, where we have ac­cess to a se­quence S=((x1,y1),...,(xm,ym)) of train­ing ex­am­ples. Each xi is an in­put to the net­work for which yi is the cor­re­spond­ing cor­rect out­put.

This topic de­serves its own com­ment. (And me figur­ing out the for­mat­ting.)

For unim­por­tant rea­sons, we square the difference

Ab­solute value, be­cause big­ger er­rors are quadrat­i­cally worse, it was tried and it worked bet­ter, or tra­di­tion?

This makes it con­ve­nient to use in the back­prop­a­ga­tion al­gorithm.

Al­most as con­ve­nient as the iden­tity func­tion.

• What’s the “ℓ”? (I’m un­clear on how one iter­ates from L to 2.)

is the num­ber of lay­ers. So if it’s 5 lay­ers, then . It’s one fewer trans­for­ma­tion than the num­ber of lay­ers be­cause there is only one be­tween each pair of lay­ers.

Ab­solute value, be­cause big­ger er­rors are quadrat­i­cally worse, it was tried and it worked bet­ter, or tra­di­tion?

I gen­uinely don’t know. I’ve won­dered for­ever why squar­ing is so pop­u­lar. It’s not just in ML, but ev­ery­where.

My best guess is that it’s in some fun­da­men­tal sense more nat­u­ral. Sup­pose you want to guess a lo­ca­tion on a map. In that case, the ob­vi­ous er­ror would be the straight-line dis­tance be­tween you and the tar­get. If your guess is and the cor­rect lo­ca­tion is , then the dis­tance is – that’s just how dis­tances are com­puted in 2-di­men­sional space. (Draw a tri­an­gle be­tween both points and use the Pythagorean the­o­rem.) Now there’s a square root, but ac­tu­ally the square root doesn’t mat­ter for the pur­poses of min­i­miza­tion – the square root is min­i­mal if and only if the thing un­der the root is min­i­mal, so you might as well min­i­mize . The same is true in 3-di­men­sional space or -di­men­sional space. So if gen­eral dis­tance in ab­stract vec­tor spaces works like the straight-line dis­tance does in ge­o­met­ric space, then squared er­ror is the way to go.

Also, thanks :)

• One rea­son for us­ing squared er­rors, which may be good or bad de­pend­ing on the con­text, is that it’s usu­ally eas­ier to Do Math­e­mat­ics on it.

• Square er­ror has been used in­stead of ab­solute er­ror in many di­verse op­ti­miza­tion prob­lems in part be­cause its deriva­tive is pro­por­tional to the mag­ni­tude of the er­ror, whereas the deriva­tive of the ab­solute er­ror is con­stant. When you’re try­ing to solve a smooth op­ti­miza­tion prob­lem with gra­di­ent meth­ods, you gen­er­ally benefit from loss func­tions with a smooth gra­di­ent than tends to­wards zero along with the er­ror.

• Another pos­si­ble rea­son for us­ing squared er­ror is that from a stats per­spec­tive, the Bayes (op­ti­mal) es­ti­ma­tor of the squared er­ror, will be the mean of the dis­tri­bu­tion, whereas the op­ti­mal es­ti­ma­tor of the MAE will be the me­dian. It’s not clear to me that the mean’s what you want but maybe?

• I always thought there ought to be a good way to ex­plain neu­ral net­works start­ing at back­prop.

Since to some ex­tent the crite­rion for ar­chi­tec­ture se­lec­tion always seems to be whether or no gra­di­ents will ex­plode very easy.

As in, I feel like the cor­rect defi­ni­tion of neu­ral net­works in the cur­rent cli­mate is closer to: “A com­pu­ta­tion graph struc­tured in such a way that some op­ti­mize/​loss func­tion com­bi­na­tion will make it ad­just its equa­tion to­wards a seem­ingly rea­son­able min­ima”.

Be­cause in prac­tice that’s what seems to define a good ar­chi­tec­ture de­sign.