Generalising CNNs

TL/​DR; The in­ven­tion of Con­volu­tion Neu­ral Net­works for image and au­dio pro­cess­ing was a key land­mark in ma­chine learn­ing.

This topic is for peo­ple who already know what CNNs are, and are in­ter­ested in how to in­no­vate to riff on and ex­tend the core rea­son (per­haps?) that CNNs learn faster. Prob­ing the tech­nol­ogy topic is one ‘sub goal’ in ques­tion­ing where our AI knowl­edge is head­ing, and how fast. In turn, that’s be­cause we want it to progress in a good di­rec­tion.

Sub Goal

Q: Can the re­duc­tion in num­ber of pa­ram­e­ters that a CNN in­tro­duces be achieved in a more gen­eral way?

A: Yes. Here are sketches of two ways:

1) Sac­cades. Train one net­work (layer) on at­ten­tion. Train it to learn which lo­cal blocks of the image to give at­ten­tion to. Train the sec­ond part of the net­work us­ing those cho­sen ‘lo­cal blocks’ in con­junc­tion with co­or­di­nates of their lo­ca­tions.

The num­ber of blocks that have large CNN ker­nels ap­plied to them is much re­duced. Those blocks are the blocks that mat­ter.

2) Pa­ram­e­ter Com­pres­sion. Give each layer of a neu­ral net­work more (po­ten­tial) con­nec­tions than you think will ac­tu­ally end up be­ing used. After train­ing for a few cy­cles, com­press the pa­ram­e­ter val­ues us­ing a lossy al­gorithm, always choos­ing the com­pres­sion which scores best on some weight­ing of size and qual­ity. Un­com­press and re­peat this pro­cess till you have com­pleted the train­ing set.

The num­ber of bits used to rep­re­sent pa­ram­e­ters is be­ing kept low, helping to guard against over fit­ting.


[Doubter] This all sounds very hand wavy. How ex­actly would you train a sac­cadic net­work on the right move­ments?

[Op­ti­mist] One step­ping stone, be­fore you get to a true sac­cadic net­work with the lo­cus of at­ten­tion fol­low­ing a tem­po­ral tra­jec­tory, is to train a shal­low net­work to clas­sify where to give at­ten­tion. So this step­ping stone out­puts a weight­ing for how much at­ten­tion to give to each lo­ca­tion. For sake of be­ing more con­crete, it works on a down sam­pled image and gives 0 for no at­ten­tion, 1 for con­volu­tion with a 3x3 ker­nel, 2 for con­volu­tion with a 5x5 ker­nel.

[Doubter] You still haven’t said how you would do that at­ten­tion train­ing.

[Op­ti­mist] You could re­ward a net­work for ro­bust­ness to cor­rup­tion of the image. Re­ward it for ze­roes in the at­ten­tion lay­ers.

[Doubter] That’s not clear, and I think there is a Catch 22. You need to have analysed the image to de­cide where to give it at­ten­tion.

[Op­ti­mist] …but not analyse in full de­tail. Use only a few down sam­pled lay­ers to de­cide where to give at­ten­tion. You save a ton of CPU by only giv­ing more at­ten­tion where it is needed.

[Doubter] I re­ally doubt that. You will pay for that sav­ing many times over by the less reg­u­lar pat­tern of ‘at­ten­tion’ and the more com­plex code. It will be re­ally hard to use a GPU to ac­cel­er­ate it as well as is already done with a stan­dard CNNs. Be­sides, even a 16x re­duc­tion in to­tal work­load, and I ac­tu­ally doubt there would be any re­duc­tion in work­load at all, is not that sig­nifi­cant. What ac­tu­ally mat­ters is the qual­ity of the end re­sult.

[Op­ti­mist] We shouldn’t be wor­ry­ing about that GPU. That’s ‘pre­ma­ture op­ti­mi­sa­tion’. You’re ar­tifi­cially con­strain­ing your think­ing by the hard­ware we use right now.

[Doubter] Nev­er­the­less, GPU is the hard­ware we have right now, and we want prac­ti­cal sys­tems. An al­ter­na­tive to CNNs us­ing hy­brid CPU/​GPU at least has to come close on speed to cur­rent CNNs on GPU, and have some other key ad­van­tage.

[Op­ti­mist] Ex­plain­abil­ity in a sac­cadic CNN is bet­ter, since you have the ex­plicit weight­ings for at­ten­tion. For any out­put, you can show where the at­ten­tion is.

[Doubter] But that is not new. We can already show where at­ten­tion is by look­ing at what weights mat­tered in a clas­sifi­ca­tion. See for ex­am­ple the way we learned that ‘hands’ were im­por­tant in de­tect­ing dumb­bell weights, or that snow was im­por­tant in differ­en­ti­at­ing wolves from dogs.

[Op­ti­mist] Right. And those in­sights in how CNNs clas­sify were re­ally valuable land­marks, weren’t they? And now we would have some­thing more di­rect to do that, as we can go straight to the at­ten­tion weights. And we can ex­plore bet­ter strate­gies for set­ting those weights.

[Doubter] You still haven’t ex­plained ex­actly how the at­ten­tion lay­ers would be con­structed, nor have you ex­plained the later ‘bet­ter strate­gies’ nor how you would progress to tem­po­ral at­ten­tion strate­gies. I doubt the ba­sic idea would do more than a slightly deeper CNN would. Un­til I see an ac­tual work­ing ex­am­ple, I’m un­con­vinced. Can we move onto ‘pa­ram­e­ter com­pres­sion’?

[Op­ti­mist] Sure.


[Doubter] So what I am strug­gling with is that you are throw­ing away data af­ter a lit­tle train­ing. Why ‘lossy com­pres­sion’ and not ‘lossless com­pres­sion’?

[Op­ti­mist] That’s part of the point of it. We’re try­ing to re­ward a low bit count de­scrip­tion of the weights.

[Doubter] Hold on a mo­ment. You’re talk­ing more like a pro­po­nent of evolu­tion­ary al­gorithms than of neu­ral net­works. You can’t ‘back prop­a­gate’ a re­ward for a low en­tropy solu­tion back up the net. All you can do is choose one such pa­ram­e­ter set over an­other.

[Op­ti­mist] Ex­actly. Neu­ral net­works are in fact just a par­tic­u­lar rather con­strained case of evolu­tion­ary al­gorithm. I’d con­tend there is ad­van­tage in ex­plor­ing new ways of re­duc­ing the de­grees of free­dom in them. CNNs do re­duce the de­grees of free­dom, but not in a very gen­eral way. We need to add some­thing like com­pres­sion of pa­ram­e­ters if we want low de­grees of free­dom with more gen­er­al­ity.

[Doubter] In CNNs that lack of gen­er­al­ity is an ad­van­tage. Your ap­proach could en­code a net­work with a ridicu­lously large num­ber of use­less non-zero weights—whilst still us­ing very few bits. That won’t work. That would take way longer to com­pute one iter­a­tion. It would be as slow as pitch drops drip­ping.

[Op­ti­mist] Right. So some at­ten­tion must be paid to ex­actly what the lossy com­pres­sion al­gorithm is. Just as jpeg throws away low weight vec­tors, this com­pres­sion al­gorithm could too.

[Doubter] So I have a cou­ple of com­ments here. You have not worked out the de­tails, right? It also doesn’t sound like this is bio-in­spired, which was at least a sav­ing grace of the sac­cadic idea.

[Op­ti­mistic] Well, the com­pres­sion idea wasn’t bio-in­spired origi­nally, but later I got to think­ing about how genes could cre­ate many ‘similar pat­terns’ of con­nec­tions lo­cally. That could do CNN type con­nec­tions, but genes can also do similar pat­terns with long range con­nec­tions. So for ex­am­ple, genes could learn the ideal den­sity of long range con­nec­tions rel­a­tive to short range con­nec­tions. That con­nec­tion plan gets re­peated in many places whilst be­ing en­coded com­pactly. In that sense genes are a com­pres­sion code.

[Doubter] So you are mix­ing ge­netic al­gorithms and neu­ral net­works? That sounds like a recipe for more pa­ram­e­ters.

[Op­ti­mistic] …a recipe for new ways of re­duc­ing the num­ber of pa­ram­e­ters.

[Doubter] I think I see a pat­tern here, in that both ideas offer CNNs as a spe­cial case. With sac­cadic net­works the se­cret sauce is some not too clear way you would pro­gram the ‘at­ten­tion’ func­tion. With pa­ram­e­ter com­pres­sion your se­cret sauce is the choice of lossy com­pres­sion func­tion. If you ‘got funded’ to do some demo cod­ing, you could keep naive in­vestors happy for a long while with net­works that were ac­tu­ally no bet­ter than ex­ist­ing CNNs and plenty of promises of more to come with more fund­ing. But the ‘more to come later’ never would come. Your deep prob­lem is the ‘se­cret sauce’ is more as­pira­tion than ac­tu­ally demon­stra­ble.

[Op­ti­mist] I think that’s a lit­tle un­fair. I am not claiming these ap­proaches are im­ple­mented demon­stra­ble im­prove­ments. I am not claiming that I know ex­actly how to get the de­tails of these two ideas right quickly. You are also los­ing sight of the over­all goal, which is to progress the value of AI as a pos­i­tive trans­for­ma­tive force.

[Doubter] Hmm. I see only a not-too-con­vinc­ing claim of be­ing able to in­crease the power of ma­chine learn­ing and an at­tempt to bur­nish your ego and your rep­u­ta­tion. Where is the fo­cus on pos­i­tive trans­for­ma­tive force?

[Op­ti­mist] Break­ing the mould on how to think about ma­chine learn­ing is a pretty im­por­tant sub­goal in pro­gress­ing thought on AI, don’t you think? “Less Wrong” is the best pos­si­ble place on the in­ter­net for en­gag­ing in dis­cus­sion of eth­i­cal pro­gres­sion of AI. If this ‘sub­goal’ post does not gather any use­ful feed­back at all, then I’ll have to agree with you that my post is not helping progress the pos­si­ble pos­i­tive trans­for­ma­tive as­pects of AI—and try again with an­other iter­a­tion and differ­ent post, un­til I find what works.