How to make your CPU as fast as a GPU—Advances in Sparsity w/​ Nir Shavit

Link post

Great video running through the promise of sparse-accelerated machine learning. Important capabilities news for interpretability folks, though this is hardly the only way to get it. 30min on 2x speed.


today i’m talking to near chevit about sparsity near has been long time active in the field as a professor at technion

and mit and has also been awarded with various prizes such as the good old prize in 2004 and the dykstra price in 2012.

he’s also founder of a company called neural magic that questions one of the fundamental core principles of

current machine learning namely you need gpus neural magic uses various

techniques such as sparsity which we’re gonna talk about today but also other optimization techniques to make

inference on models like bert to be as fast as a gpu on a regular cpu this is

pretty huge and can have vast implications on where you can deploy these models and just how expensive it

gets to roll them out to many people in many places so today we’ll talk about the biological foundations for sparsity

why we shouldn’t attempt to replicate the brain and just what it takes to make something go really fast on just the cpu

i hope you enjoyed this conversation if you do give nier and his company a follow and i’ll see you around bye-bye

hi this video is sponsored by assembly ai assembly ai does real-time and batch

audio transcription of audio and video files powered by the latest advances in

artificial intelligence so if you are a developer or work for a company that’s looking to get more out of your audio or

video data through transcription and audio intelligence assembly ai is the best place to go not only do they have a

user interface where you can just upload stuff but they do have a very powerful api but transcription isn’t all they do

once your audio is described they actually post-process it in many different optional ways so they can do

things like speaker classification or annotations of various forms inside of your audio one feature i’d like to

particularly highlight today are the auto chapters for this simply provide auto chapters equals true on your upload

and assembly ai will after it’s transcribed your audio automatically recognize chunks of audio where you talk

about the same thing give you a summary of those chunks and a neat single description headline of what you were

talking about there this is absolutely ideal for anyone who does any sort of long-form podcasting or videos like mine

where viewers are very very helped by the fact that there are chapter annotations and to have these be done

automatically is just absolutely great so if you’re interested head on over to assembly ai use the link in the

description to let them know that i sent you they are the single api to transcribe and understand audio they do

so in batch and in real time via websocket they accept all kinds of audio and video formats and they do so in over

15 languages give it a try and thank you very much to assembly ai for sponsoring this video and now let’s get into the

video the topic of sparsity is a big thing in neural networks right now mostly because

we have no idea really how to do it and i think that’s exciting times for the

future so uh welcome what what brings you into the sparse world actually i um

you know i’ve been a professor of computer science for many years and i um

worked on multi-cores for more than 30 years and

got involved in computational neurobiology in the last 10 years

and one of the things that you really see in the brain is really how sparse

its computation is it really is very very sparse and and so

you know looking at neural networks we see that there are there’s a similar phenomenon to what happens in

brains happening in neural networks right where you can actually reduce the

number of parameters through pruning by huge amounts and preserve accuracy of

the performance of the network and that kind of says okay if we really want to

have brain like performance you know sparsity is probably one of the tools

that we want to use to get there so that’s kind of how i kind of got into this uh

yeah and you founded a company that also

works into this direction right you want to talk about that a little bit yes i founded neural magic

neural magic was founded because what we were seeing in my lab i was busy with

doing machine learning at a large scale for biology projects and what we realized

was that we could get cpus to run at gpu speeds like at the time it was a pascal

gpu and we could make just a regular cpu do what the pascal gpu was doing um

through the use of of sparsity and other similar uh techniques and so we said

okay well there’s a real commercial value here for people because you don’t need an accelerator you can just do it

on your commodity cpu and that’s that’s normal magic so what we do is we deliver

you know through sparsity and similar optimization techniques um gpu performance on cpus that is is quite a

promise maybe let’s first dive into a little bit about sparsity itself what is it about sparse you mentioned the brain

is very sparse yet our current or at least the way we train neural networks is very dense we

can accelerate the dense neural networks much better what is it about sparsity is it just the saving of parameters or is

there something more to sparse connections than to dense

connections what do we know that’s a good question so clearly what we’re doing today is not

the sparsity that we will be doing in the future what i mean by that is your brain is sparse way beyond the levels of

what we see in neural networks today so your typical brain in terms in terms of the compute

right you know your cortex is like a cell phone of compute right but the graph is enormous it’s like you know the

graph is is the size and you need petabytes to basically hold it so so a

cell phone of compute on a petabyte or more of memory right but the

accelerators that we build you know are designed to deliver petaflops of of

compute but on a cell phone size memory their memory is very limited because they use this high bandwidth memory so

so in a sense we’re building the opposite of what we want right so if we want to mimic the brain

we should not busy ourselves so much with the amount of compute and rather worry about how it is that we implement

this very large graph it’s a very large graph but it’s extremely sparse that’s the point right

and as you asked the sparsity is not necessarily the same sparsely that we do today through pruning techniques but

it’s a combination of a very sparse architecture together with um you know a sparsity in

what we call in machinery in the kernels right so it’s not just that the kernels are sparse but everything in the in the

design is very very sparse okay and we don’t know yet how to design

very sparse architectures part of that has to do with the fact that machine

learning grew up in the gpu world where sparsity is not an advantage actually because

you’re doing lockstep computations so you win nothing by being very sparse and

therefore you know we don’t we don’t see those architectural sparsity things yet

but um but i’m expecting that to happen we should be this should

come along you know and and even more than that what i expect is

things are starting to show up like the the pathways from models from google and so on where

even if you have a very large model you don’t execute the full model layer

after layer but rather you execute small regions of the model at any given time

per input that’s another form of sparsification of your computation right

and that is what the brain really does so your brain typically you know when you see an input or so on

uses a very small fraction of its total graph to do the computation

and so that’s where we’re headed we’re not there yet we don’t know how to do it but but this is the goal

and that’s the old you only use 10 of the brain

at any given time right yeah right that’s right i mean really from from energy considerations it really is like

a cell phone okay it really isn’t you know this massive monster multi-gpu

thing that we use today and so my expectation is that you know

that as we learn more and more about how to design sparse networks we’re going to

see them become the standard they’re not the standard right now because we started the whole journey right by

applying flops and still applying flops is the the main paradigm

but we will see it appear both in hardware and accelerators and in cpus

um this idea that we can utilize sparsity you know to get really great

performance gains yeah that’s coming now is the question is a little bit the

chicken and the egg problem is the brain sparse because it has the limitations of

the cell phone power or does the brain only need cell phone

power because sparsity is such a good architecture right like which which causes which

yeah um so so i would say that

you know the whole notion of parallelism in the brain right um if you think about

it imagine that you need to do a billion operations per second okay and what you have are these very

slow chemical devices neurons right that can do that right so you need a billion

operations a billion you know firings of neurons in a second how are you going to do that well what you need is massive

parallelism right you’ve got to get massive parallelism if you can do the massive parallelism you can get the

billion operations right and and and so our brains are parallel

if you will because we have this special medium right now on a modern

multi-processor right you can get a billion or 10 billion instructions executed you know per second

sequentially you don’t really need parallelism for it right and so what i’m trying to say is you know the whole idea

of of kind of how brains evolved is clearly because of the way you know

they’re implemented but we should not think of of going and implementing this in in uh

in silicon in the same way right because we really what we really should think about just is that both of these things

are turing complete right you can do you can implement the algorithm you just

need to know what the algorithm is and then on silicon we’ll implement the best algorithm we can right

you know of the of the brain but we don’t have to have the exact architecture of the brain

to do that okay does that make sense that that’s my what i’m trying to say

you know let’s implement the algorithm but not necessarily the architecture okay so when i say sparsity i really

mean sparsity algorithmic sparsity right and it doesn’t mean that you have to

have a very sparse kind of you know silicon vlsi circuit to do this that’s

not the case yeah given that we that that’s a good segue given that we

do have the flops right that we don’t have in the brain it naturally it is a

different a different system we do have terraflops petaflops even in these giant

compute clusters where should we put them in your opinion like where where should that extra

resource that the brain doesn’t have go should it go into sequentially executing

what the brain executes in parallel or you know where should we put that so first i want to say

is that we have those flops but they’re costing us a lot and you

just have to open the papers to see what the cost of the flops is it’s enormous

an enormous energy drain and it’s also an enormous uh architectural drain on

what we’re doing and so i would say we want to get rid of the flops because probably we don’t need them okay and

especially as you go from the data center down to the edge you get the your capability of

delivering flops comes directly at the you know if at the edge you can put the sorry in the data center you can put you

know your google um data warehouse right next to a waterfall or whatever you want

right to a source of energy right when you’re doing this on your cell phone or on a tiny device at the edge every

little uh bit of energy that you waste is critical for you right and so what we really want

to do is move away from the flops and move more towards the very energy efficient way the brains work because

this adding more flops is a momentary thing for us right so yes we can do this

but at a very high cost and no we don’t want to do this forever we want to find ways to cut the cost reduce the compute

and and and there’s a little other thing that i want to say and that is architecturally

we generate the flops by running right now at least by running many many many

tiny cores thousands of tiny cores typically right in an arc in

architectures they require a lot of connections to the memory this high band with memory and this thing doesn’t scale

so in a sense we’re trading flops for memory if you use the cpu today you

could get a terabyte on your desktop but go get a terabyte on a gpu right and so

boosting the flops is going to enable us changing the architecture if we don’t need so many flops then we can actually

increase the size of our memory which will make us able to hold these giant models that we want to do very cheaply

if you will if i explain a deep neural network to someone i usually you know

you start with a fully connected layer you say you know here is a layer of neurons and here is a layer of neurons

and they have their connections right and each connection has a little weight and so on you usually describe like a

dense fully connected architecture and that is conceptually i want to say easy

to grasp for people and so on do you have an analogy for sparse architectures

like what is the conceptual like could you conceptualize

to someone who doesn’t know what like a sparse architecture is and how to think about it what is different

yeah the way we do sparsity today i don’t know what it’ll look like in the future but but today sparsi looks like

imagine that the two layers of the neural network are these kind of there are chords from one

layer to the next right there springs attached and these are of course these are the connections the weights that

we’re using in the computation right and varsity means i take scissors and i chop

chop chop chop chop you know until i have five or ten percent of those chords left right and those chords it turns out

right if i do this right if i do this kind of pruning right are good enough to capture

right the uh accuracy of the model as it was before because a lot of the

connections are not important for this process that’s kind of the big discovery and

modern research in in techniques for for sparsification right um you know

play along this kind of game so you can do this kind of unstructured thing that i just described where you arbitrarily

cut in many places based on on the effectiveness or you can also structurally take things out so in a lot

of the modern models right we’re removing pieces that are not necessary we do architecture search to find these

uh these uh places to things right so that’s where the whole game right now of

efficiency and neural networks right is the game of how do i cut this thing down

right in the brain there are certainly some systems like the visual system

where that is clearly organized into layers but there are many other systems that have no

resemblance to layers there are connections going up and down and left and right and you know between the the

halves of the brain and all is there a possible future where this

could become into like a standard architectures for neural networks that the notion of layers and

things like this isn’t even really a you know a thing anymore or is there you

know some some fundamental way where we say no there’s probably always going to be layers but it’s just going to be

sparsity between those layers so when we look at you know we have a full

connectome of essentially only a couple of animals a worm and a fruit fly that’s it and as

that said don’t see a lot of layering there it looks more like a mess very sparse mess okay

um and um i would i i wouldn’t venture to think about how

what cortex what a cortex looks like right um we don’t have that yet we’re working

very hard to it’s a very these are very hard computational problems to be able to

to go and get a model we just want to do a mouse even a mouse is just too big for us to do right now like a small mammal

right but my i would venture to guess that yes the answer is that you know

it’s extremely it’s an extremely sparse architecture and that it wouldn’t it will not look like layers

okay you can impose a layer structure on any graph okay it’s not so the idea that i

say there aren’t layers sure okay i can take the graph and i can layer it yeah i can do a bfs on it and

layer it but but the point is not so much that it’s more that by design when

i think about it right i’m not going to think about it as a sequence of layers where the change that i make is the

change in the layer one layer is different from the other but rather it’ll be a combination of thinking about

paths different paths and i’ll do different things along different paths

that’s kind of the idea you know if you think about you know there’s there’s

recent research from mit you know you can detect um people can detect an image

in in 0.13 uh set 0.013 seconds in 13 milliseconds

okay in 13 milliseconds you can detect that you can say what an image is okay

this is there’s no time for neurons to fire this thing is is extremely kind of

parallel right and uses very little compute and gets you an answer and and a

large part of that is prediction because you’re already expecting something so we need to learn how to do

those things and so machine learning right now is in a very naive early stage

and so given that and given the things that we are doing right now it’s not it’s not a surprise that we’re doing the

brute force kind of massive compute kind of thing that’s always what you do and with time we’re going to get better and

better at it right so that’s kind of how i see this progressing

speaking of becoming better uh if you know the flat worm is sparse the mouse

is sparse the human is certainly sparse yet our best models today are all big

dense you know computation hungry things there is not really a case every time i prune i

sparsify and so on i get savings in per like you know savings in cpu or gpu i

get savings in you know my storage but i also get like a little bit worse right

that’s the common thing today in pruning is that i get like just a tiny bit worse than the

dense model i prune from why do you do you think that is just the fact that we prune from a dense model or

what’s holding back the sparse models so how about if i if i turn this around

let me turn this around for you okay you can take you can take bert uh

bass which is a common model that people uh use okay and you can sparsify bird

base um at neural magic we sparsified 95 so a

95 sparse bird base 1 over 20th of the compute okay way beyond anything a gpu

does even if you run it with full throttle okay it’s just cutting the compute so much that there’s really

almost nothing to compute there it’s just moving data okay no i’m exaggerating of course but but you know

it’s really becomes a data movement problem rather than a compute problem when you when you and and you lose

one percent less than one percent accuracy okay um and i say okay great so you’ve done

that you know and you’ve gotten all this uh speed up but you’ve lost you say oh near but you lost less one percent

accuracy but what i say instead is forget that take bert large a much more accurate

model several points more accurate than bird-based okay and prune it so that it

actually right with 20x less compute it’s actually faster than birthdays okay

and so now you have the accuracy right and you have great compute and

this is through sparsity so by sparsifying the larger model i actually delivered you the best of both worlds

little compute and great accuracy and that’s how i want you to think about sparsity right it’s a way of enabling us

to run much larger more accurate dense models but because we specified

them we are you know we’re getting great performance that’s how to think about it

what’s the limit currently that keeps us from we always need the dense model first in

this model in the pruning in the pruning setup we first need the dense model then we go to the sparse model we get

huge savings at inference time what keeps us from just building the sparse model in the first place

great so this is kind of the lottery ticket kind of question if you will um

there is research actually dan alistair one of our uh consultants uh neuromagic works exactly

on this kind of stuff we know how to um to run um a training session right now

four four models where you start out and you need to do only a certain fraction of the um you

know of the uh forward passes backward passes dense and then immediately you can already start pruning while training

so so there is research going in that direction but you are right that right now at least right in the in the

standard if you look at what’s going on there out there standardly you’re right we do most of the time take a standard

model and and from dents we sparsified and so on but

but the thing to remember and this now i’m not talking about the research because the research is going to get there you know yannick i don’t know if

to what extent we will uh how fast this will happen and so on but we will learn how to build sparse architectures and

start sparse and continuous you know it’s it’s really a matter nature does this and so there’s no

reason why we’ll be able to do it but i want to say something about today’s uh machine learning where where you kind of

start with the dance and then you have to sparsify this is really not the common paradigm

for most users of neural networks for most users a model is is given to them

that you know from a from a known architecture right and then they transfer learn onto it and most people

do that rather than train from scratch they really use the model that somebody already worked very hard to build for

their specific use case and then they transfer learn onto it so this is what you can do with sparsity you can take a

sparse model and sparse transfer learn onto it it’s extremely efficient because you’re running at the speed of the

sparse network right so you can sparse transfer and then you don’t need all of

this kind of start with dents and and we’re seeing more and more sparse networks um appear you know in

the in the in the literature and the data in the you know in database collections of

machine learning models and as we have more and more of these initial good sparse models right people are going to

learn to start with the sparse already that’s kind of commercially i think that’s what we’re going to see more and

more of yeah why you mentioned this a bit already but why

are gpus so unsuited for sparse models and what

makes cpus in the way you do it really suited for sparse models or are

they even suited or are you simply you know seeing that they’re better

yeah i mean look the the gpu architecture you know is is designed for this very you know

small course tiny caches you’re not going to go and throw all that away

just because you know you found you discovered sparsity so you’re trying to do sparsity while keeping this kind of

lockstep execution structure right and this is difficult to do sparse you need

you need uh you you need you need really a different kind of setup to get an

advantage out of sparsity now now i’m not i it’s not like you can’t do

that right it’s not like you can’t do that people can design and have design

hardware that utilizes sparsity efficient okay there is such hardware it’s just

not a it’s not gpu-like it’s not like the accelerators that we have today um

but all of these again all of these accelerators have a different problem that has just to do with the memory

because of the way they’re designed right they typically have very small memories so we’re talking even even ones

that can run sparse right still have the limitation of their memory size so the reason that cpus are attractive is not

so much that you know that they that you have a natural way of running sparsity because you can run

asynchronous with large cores but rather that the large cores enable you very easy access to very

large memory pools right so the advantage of having strong powerful

pores right is really that i can put several terabytes of memory next to them

right and run easily and that’s where the big advantage is going to be as we understand more and more about how to

build giant models that don’t run all the model layer by layer at the time

right then the compute will be less important but actually the ability to hold that model in one place and run it

rather than break it apart on 8 or 16 gpus that’s going to be your advantage

and so this is so i’m kind of saying it’s not so much that you can’t build a hard piece of hardware to run sparsely

you can right but you should build it looking like a cpu in the sense of you

can access a lot of memory because you’re not doing tiny cores that’s kind of

that that’s my two cents so the the cpus are good because they have you know fast

connect to large memory but also over the years we’ve put more and more levels of cash onto the cpu how much do you

have to have to take this into account when you’re building i mean you’re maybe you can explain a little bit what your

company does in terms of software you build compilers or can i just run tensorflow

or something yeah so so let me explain so so so first of all the the the connection between

the cpu and the memory is slow gpu has faster memory and faster access to it

right smaller but faster right cpu memory is slow but large very large uh

but cpus have a cache hierarchy as you said and so if you you know how to utilize your cache hierarchy then you

know if you’re running in the l1 cache of the cpu okay you’re running as fast as the gpu there’s nothing there that

the gpu does that the cpu can’t do once you’re in cash okay in fact cpu caches are much faster than gpu caches and the

performance is better so so the so the question then right and this is what neural magic does is okay

so what we do is we sparsify the model now you know if if the pro you know machine learning is about okay

i need to meet a certain latency and because i couldn’t meet that latency with a cpu

then we added the gpu and boom there’s machine learning with gpus now i can meet the latency but there’s two ways to

deal with latency one is to add more flaps and the other is to reduce the flops right and so sparsity instead of

adding more flops and hardware reduces the number of flops needed in software but now that you have this very sparse

model because the cpu memory is slow okay then what happens is you hit a

bottleneck and it’s very hard to move if you do this layer after layer it’s very hard to move the data in and out okay so

what neural magic invented is a way of running neural networks depth wise so we have the this technology which we call

tensor columns where essentially you can run okay you know you can break the model lengthwise and run you know each

one of these kind of columns you know um in cash okay and you because you’re not leaving

l2 really you’re rarely leaving l2 you know you actually get great performance

so in a sense right what we’re doing is we’re using the natural ability of cpus to prefetch things from memory and then

run in cache and because this you know this cache hierarchy on cpus has evolved

over 70 years or i have maybe i’m exaggerating 60 years of hardware design it’s a very very

well understood thing where people know how to optimize it right especially the big uh you know chip makers they really

know how to make these caches work really well and so with these really good cache

hierarchies um you really get great uh performance by

running the model depth-wise so that’s neural magic you know we take the model sparsify it now it doesn’t need the

compute and now we run it on the cpu and get speed because we’re running in cash okay and if you look at the numbers i

mean you know we we are you know at the speed of of i mean some numbers we haven’t punctured we’re

at the speed of an a100 even faster in terms of how long it takes a four core cpu can in terms of latency do what a

a100 does on a common model like bird okay so it’s really the the amp given

that it’s sparse or yes yes yes by sparsifying it and running it you can make a four chord do what a100 does so

it’s really now a matter of throughput and the a100 has a lot of throughput okay so now the question is you know how

many cores do you want on your cpu to meet the throughput of the a100 and again the story is that you know the big

providers are adding more and more and more cores so you’re going to be able to compete better with the gpus

down the road so that’s kind of the the the story of neural magic yeah

so the way i can imagine these these tensor columns is that because i execute depth wise the sort of values that i

need for the next step in the computation are the results of the very last step therefore are already going to

be in cache and since everything sparse i don’t i don’t need all of the last

layer for the current step and therefore you know i have it already okay right and of course i’m i’m i mean you know

when you think about a neural network there are overlaps between these columns and the question is how do you deal with

the overlaps in a way that doesn’t kill your computation and that’s the magic that’s the magic of it there’s an

algorithm that allows you to do that and because you can do it you manage to run this way and you don’t hit this memory

bottleneck and boom you’re in business yeah so for

gpu it’s almost like you know gpus enable us to do dense models but i think

also models have almost co-evolved with the gpu so people have started building

models to fit the gpu architectures better right especially something like a transformer is like that’s a that that’s

like made for gpus um is there a type of sparse model like if you if you could

wish for the best possible sparse but you know there’s different kinds of sparsity like what is the best

type of sparsity to let’s say execute on a cpu if we want to look forward and we

want to especially build architectures for that yeah this goes back to your original for

one of the first questions you asked right it’s about it’s about a different structure for the neural network execution so we should forget the

synchronous layer after layer execution and think about the fact that you know

we can run through a model right in multiple paths with multiple computing

units use the same weight structure and so on of the model right but run at

different speeds and by running at different speeds and and going through the model in different paths i can get

from the same model multiple answers to my questions which is kind of what i i believe what your

brain does so what happens there is you have this network but it’s not like you

know it’s all firing like this layer after layer it’s rather you have these asynchronous flows going

through it right even going through matching pads and cpus are naturally

built for this thing now i’m not saying that somebody can’t build a beautiful fpga that will perhaps have a better

closer structure to what a brain does maybe so but but you know but there is an advantage

to being commodity okay the fact that the cpu can do other things is a big win

if i can make if i can move everything to software is really is the thing then i can really get all the advantages of

modern software so i’m not pulling hardware accelerators i’m saying great you know

they have a role and so on and so forth but they come at a price right and the price for any organization is that you

instead of just downloading or shipping your product with the machine learning piece you have to ask the client to buy

a certain accelerator or run it with a certain accelerator and this all goes away if we can figure out how to make

the cpus do what the gpus do right then we have then we’re back into this

beautiful world of containerized movable software and that’s really

kind of where i would love machine learning to move to rather right that we would have and maybe down the road right

there is this you know you know cpus have have a history of absorbing the key components of any new

paradigm that shows up you know virtualization started out with tricks on a g on a cpu and then later on added

the features networking had special accelerators and then they moved into the cpu and i’m expecting that whatever

features are necessary for machine learning to run well we’ll move into the cpu and we won’t need an outside

accelerator to make this thing work if you could um

so i think that’s by the way also the story of gpus themselves right they were already kind of consumer-ish available

and then they can’t they they absorbed machine learning it’s not necessarily the best architecture for machine

learning but let let’s say let’s say there’s already all this hardware out there right there is very good cpus next

to very good gpus how do we get the best out of a machine like this

right right now we’ve advocated for let’s move things to the cpu right we have some advantages there but what if i

have a box with both like currently i just use my cpu to ship data to the gpu

right that that’s what my cpu does but is there a way where i could potentially

you know what kind of architecture would make the best use out of a combined

system of cpus and gpus no i think this is really the vision that nvidia has at least today for their

grace hopper architecture it’s essentially this there will be a cpu and a gpu connected to one another and the

cpu will do all the things that are memory intense and the gpu will do all the data intense things the thing about

the problem with this kind of a model is it’s a beautiful model by the way i’m not saying anything uh bad about this if

you if you really want to build a gpu world that’s a great thing to do but again

the you know how you how much you utilize your gpu your attached gpu

has to do with how you write your application because you need to move the data into the gpu in and out and that’s

slow right you remember it’s like it’s exactly like going to memory right it’s

the gpu is not up it’s not sitting in your in your caches so if you’re on the cpu and you’re computing something on a

cache and suddenly you get a page fault and you have to go and get something from memory that’s the latency that the

gpu introduces here right and so if if you’re going to design it with that

you have to create really good software to pipeline things and this is at the level of the application

so the application programmer has a big programming task and so this is a great

solution for large-scale big projects where okay

i’m gonna facebook is gonna get you know a thousand of these or ten thousand of these whatever it is you know uh or or

google ten thousand a hundred thousand of these and you put them together with then it’s worthwhile to write this kind

of complex software but if you’re joe company right and you have your little thing i don’t think you want to be

writing that interface right so so kind of so i’m saying it’s a it’s a it’s great

for large things right data center things big things but i’m very doubtful if this is

going to be um effective at the edge if you can

um actually utilize the cpu for it okay and and i will say one more thing

and that is that um you know that

the modern way that the designers of hardware think about it is that is mod it’s

built-in modules if you look at the if you look at the amd latest architecture right essentially you have these ccx’s

so so the machine even though it has you know maybe 40 or 50 or 60 cores

right they’re grouped into groups of eight right and each group of eight like this is a little piece of the die okay and i

think intel is shifting in that direction too so nothing’s to prevent you from making pieces of that die be

specialized pieces of hardware like a gpu you don’t have to have outside device so if you ask me what the future

is going to look like it’s probably going to look like you know these large cores right that have um or large

machines with with multiple dyes and on these dyes we might have a gpu die we might have accelerated and that’s more

like what i expect to happen rather than having a massive you know accelerator on

the side if we if if we hear sparsity and uh things not

being in layers and so on naturally the topic of i think graph neural networks is very close to that at least in the

imagination of people do you have anything to say about you know where current graph neural networks stand with

respect to sparsity yeah i would think of graph neural networks as a as a as a different kind

of okay so so graph neural networks i i use some some graphical networks in my

research and the and the idea there you know is that you know we can use graph neural

networks to solve graph problems that otherwise would be very complicated to

solve if we tried to solve in brute force okay now it’s not generally applicable

there are quite a few limitations but but as a tool i would say

that you know rather than think about the neural network itself as being looking like a graph neural network

right i could use graph neural networks right um to define um what we call motifs in

the neural network so for example when we try to look at at how brain struct

brains are structured right when we look at the graphs of brains and we try to understand you know is there a motif

that is repeating itself in this graph right then using a graph neural network for that is a really nice way to try to

find these motifs okay efficiently right um because the problem itself is

is p space complete or actually we don’t know it’s it’s a graph isomorphism so so

clearly we don’t know right how to do the brute force algorithm well but but the graph neural network can

come to our aid here and so so i would say that right now i don’t really see

a a real network design neural network design that is specific to that or a way that

it helps but but in research it definitely works and we really want to use these networks to help us in

research um this might be a bit of a tech bro

question but if i hear you know i can do sparse computation very i can reduce the

flops and so on um is there any intrinsic connection between the

sparsification of neural networks the non-layer-wise computation and blockchain technology and smart

contracts and distributed computing and things like this have you ever given this any thought or or

uh is that completely off yeah look i think nothing is completely off

with respect to maschine in the sense that i am sure that machine

learning will find its way into into all of those areas

right it’s a matter of time and um and right now right the all the work

there doesn’t need the efficiency of right of what machine learning offers

because machine learning in the end is an optimization and so when i think when all these blockchain

algorithms and all you know become more commonplace and we need to provide them

with things like security further security or analysis and so on i think then we’re going to see applications of

machine learning there and with that i think all these things of sparsity and so on

are going to are going to appear but you know but but for me right

it really is the whole story of sparsity right is the story of uh of a phenomenon that is very

prevalent in nature right that may you can say surprisingly or not

surprisingly shows up in machine learning and it kind of it makes me feel like it’s strengthening

my my belief right that even though the exact computations that we’re doing are

not the same as spiking neural networks and brains right that there is a lot of commonality there

and the emergence of these similar phenomena like sparsity like you know pruning and so on and the fact that we

can get benefits from it this tells me oh okay these are related i think that’s a very important

uh interesting point to keep in mind with neural magic who is your main

target audience like who who is listening to this do you want to let know like we are exactly for you so we

span the gamut from the data center to the edge um i would like to say i mean we just now

are moving into providing the same properties for arm architectures and so i would say the

exciting new thing at neuromagic is we’re moving from doing this you know uh for amd and intel architectures to doing

it for arm which means that we’re going to spam again all the way to the very bottom of the of the food chain if you

will and i think this is very exciting because as you know because because sparsity has a dual role as you go down

the food chain right because for the large accelerating you know the the fact that the memory footprint is large or

small is not that important but as i go down sparsity gives me two things speed with neural magic gives you speed but it

also makes the model extremely small so you’re getting a small accurate model right running on a very small device and

this you know typically is an arm device and so that’s that’s that’s the audience

that i’d like to say hey we’re coming you know we’re coming and we’re going to deliver the same things that we can deliver for intel and amd we’re now

going to deliver it for arm at the very end at the very edge if you say edge do you mean

smartphones do you mean security cameras do you mean robots everything okay

i mean everything i’m not not like i’m going to do everything to start with but but yes yes we’re aiming in that direction yes

and with the danger that this has become going to become like a a marketing opportunity question but how easy is it

to get started with what you’re doing like let’s say i’m a i’m i’m like i’ve

done you know my tensorflow tutorials i know how to build a model and train it and so on like

how much does it take for me to transition or to to apply what you’re doing

yeah so you just go to our website go to get go to get download deep sparse our

you know our engine download our uh ml tooling and um you know immediately you

just either pick a sparse model and transfer learn onto it with our tool so we have recipes you have a model you have a

recipe exactly what you would do if you went to hugging face and downloaded a model and downloaded a recipe you do the

same kind of thing and you sparse transfer learn onto it and you’re in business so

it’s not very hard so i think this is really and we’re working on making it even even easier

this is one of our goals right is to make it really really easy to do this and the advantage of course is that you

know people are already busy uh you know quantizing their models to

get more performance so this is like quantizing in some sense right you’re going to do the same kind of thing

and get yeah a lot more performance yeah is there a type of model where it works

particularly well and the type of model where it doesn’t like i’m thinking you know convnets recursive networks other

regressive maybe you know the big language models like what what is it best at

yeah so right now you know it’s best that at bird yolo models we do we do

computer vision and we do uh and we do the language models but not the large language models we haven’t done a large

language models yet so for those types of things like the birds and the yolos and the you know the uh whatever the

variants of efficient nets and all these guys this is you know visual transformers these are the things that

that we do right now and and every all our technology is right now

you know available for those um i’d love to do the large models a cpu is a natural environment

for running the knowledge models you know these giant models these trillion or whatever parameter

models that people talk about splitting across 16 gpus they fit on your desktop

okay so clearly a cpu is a natural place to run a very large model okay and

so that’s that will be a target but right but not right now okay very exciting uh is there any last

things you want to get out maybe about neural magic or sparsity in general well you know our our whole machine learning

software stack is open source and we’d love people to come in and help us build

you know better sparsity use sparsity in their models and and tell us about what

they’re doing and you know that it would we have a community and we’d love you to join our community

excellent near thank you so much for being here today this was very pleasant

thank you very much bye-bye bye-bye [Music]

No comments.