Great video running through the promise of sparse-accelerated machine learning. Important capabilities news for interpretability folks, though this is hardly the only way to get it. 30min on 2x speed.
today i’m talking to near chevit about sparsity near has been long time active in the field as a professor at technion
and mit and has also been awarded with various prizes such as the good old prize in 2004 and the dykstra price in 2012.
he’s also founder of a company called neural magic that questions one of the fundamental core principles of
current machine learning namely you need gpus neural magic uses various
techniques such as sparsity which we’re gonna talk about today but also other optimization techniques to make
inference on models like bert to be as fast as a gpu on a regular cpu this is
pretty huge and can have vast implications on where you can deploy these models and just how expensive it
gets to roll them out to many people in many places so today we’ll talk about the biological foundations for sparsity
why we shouldn’t attempt to replicate the brain and just what it takes to make something go really fast on just the cpu
i hope you enjoyed this conversation if you do give nier and his company a follow and i’ll see you around bye-bye
hi this video is sponsored by assembly ai assembly ai does real-time and batch
audio transcription of audio and video files powered by the latest advances in
artificial intelligence so if you are a developer or work for a company that’s looking to get more out of your audio or
video data through transcription and audio intelligence assembly ai is the best place to go not only do they have a
user interface where you can just upload stuff but they do have a very powerful api but transcription isn’t all they do
once your audio is described they actually post-process it in many different optional ways so they can do
things like speaker classification or annotations of various forms inside of your audio one feature i’d like to
particularly highlight today are the auto chapters for this simply provide auto chapters equals true on your upload
and assembly ai will after it’s transcribed your audio automatically recognize chunks of audio where you talk
about the same thing give you a summary of those chunks and a neat single description headline of what you were
talking about there this is absolutely ideal for anyone who does any sort of long-form podcasting or videos like mine
where viewers are very very helped by the fact that there are chapter annotations and to have these be done
automatically is just absolutely great so if you’re interested head on over to assembly ai use the link in the
description to let them know that i sent you they are the single api to transcribe and understand audio they do
so in batch and in real time via websocket they accept all kinds of audio and video formats and they do so in over
15 languages give it a try and thank you very much to assembly ai for sponsoring this video and now let’s get into the
video the topic of sparsity is a big thing in neural networks right now mostly because
we have no idea really how to do it and i think that’s exciting times for the
future so uh welcome what what brings you into the sparse world actually i um
you know i’ve been a professor of computer science for many years and i um
worked on multi-cores for more than 30 years and
got involved in computational neurobiology in the last 10 years
and one of the things that you really see in the brain is really how sparse
its computation is it really is very very sparse and and so
you know looking at neural networks we see that there are there’s a similar phenomenon to what happens in
brains happening in neural networks right where you can actually reduce the
number of parameters through pruning by huge amounts and preserve accuracy of
the performance of the network and that kind of says okay if we really want to
have brain like performance you know sparsity is probably one of the tools
that we want to use to get there so that’s kind of how i kind of got into this uh
yeah and you founded a company that also
works into this direction right you want to talk about that a little bit yes i founded neural magic
neural magic was founded because what we were seeing in my lab i was busy with
doing machine learning at a large scale for biology projects and what we realized
was that we could get cpus to run at gpu speeds like at the time it was a pascal
gpu and we could make just a regular cpu do what the pascal gpu was doing um
through the use of of sparsity and other similar uh techniques and so we said
okay well there’s a real commercial value here for people because you don’t need an accelerator you can just do it
on your commodity cpu and that’s that’s normal magic so what we do is we deliver
you know through sparsity and similar optimization techniques um gpu performance on cpus that is is quite a
promise maybe let’s first dive into a little bit about sparsity itself what is it about sparse you mentioned the brain
is very sparse yet our current or at least the way we train neural networks is very dense we
can accelerate the dense neural networks much better what is it about sparsity is it just the saving of parameters or is
there something more to sparse connections than to dense
connections what do we know that’s a good question so clearly what we’re doing today is not
the sparsity that we will be doing in the future what i mean by that is your brain is sparse way beyond the levels of
what we see in neural networks today so your typical brain in terms in terms of the compute
right you know your cortex is like a cell phone of compute right but the graph is enormous it’s like you know the
graph is is the size and you need petabytes to basically hold it so so a
cell phone of compute on a petabyte or more of memory right but the
accelerators that we build you know are designed to deliver petaflops of of
compute but on a cell phone size memory their memory is very limited because they use this high bandwidth memory so
so in a sense we’re building the opposite of what we want right so if we want to mimic the brain
we should not busy ourselves so much with the amount of compute and rather worry about how it is that we implement
this very large graph it’s a very large graph but it’s extremely sparse that’s the point right
and as you asked the sparsity is not necessarily the same sparsely that we do today through pruning techniques but
it’s a combination of a very sparse architecture together with um you know a sparsity in
what we call in machinery in the kernels right so it’s not just that the kernels are sparse but everything in the in the
design is very very sparse okay and we don’t know yet how to design
very sparse architectures part of that has to do with the fact that machine
learning grew up in the gpu world where sparsity is not an advantage actually because
you’re doing lockstep computations so you win nothing by being very sparse and
therefore you know we don’t we don’t see those architectural sparsity things yet
but um but i’m expecting that to happen we should be this should
come along you know and and even more than that what i expect is
things are starting to show up like the the pathways from models from google and so on where
even if you have a very large model you don’t execute the full model layer
after layer but rather you execute small regions of the model at any given time
per input that’s another form of sparsification of your computation right
and that is what the brain really does so your brain typically you know when you see an input or so on
uses a very small fraction of its total graph to do the computation
and so that’s where we’re headed we’re not there yet we don’t know how to do it but but this is the goal
and that’s the old you only use 10 of the brain
at any given time right yeah right that’s right i mean really from from energy considerations it really is like
a cell phone okay it really isn’t you know this massive monster multi-gpu
thing that we use today and so my expectation is that you know
that as we learn more and more about how to design sparse networks we’re going to
see them become the standard they’re not the standard right now because we started the whole journey right by
applying flops and still applying flops is the the main paradigm
but we will see it appear both in hardware and accelerators and in cpus
um this idea that we can utilize sparsity you know to get really great
performance gains yeah that’s coming now is the question is a little bit the
chicken and the egg problem is the brain sparse because it has the limitations of
the cell phone power or does the brain only need cell phone
power because sparsity is such a good architecture right like which which causes which
yeah um so so i would say that
you know the whole notion of parallelism in the brain right um if you think about
it imagine that you need to do a billion operations per second okay and what you have are these very
slow chemical devices neurons right that can do that right so you need a billion
operations a billion you know firings of neurons in a second how are you going to do that well what you need is massive
parallelism right you’ve got to get massive parallelism if you can do the massive parallelism you can get the
billion operations right and and and so our brains are parallel
if you will because we have this special medium right now on a modern
multi-processor right you can get a billion or 10 billion instructions executed you know per second
sequentially you don’t really need parallelism for it right and so what i’m trying to say is you know the whole idea
of of kind of how brains evolved is clearly because of the way you know
they’re implemented but we should not think of of going and implementing this in in uh
in silicon in the same way right because we really what we really should think about just is that both of these things
are turing complete right you can do you can implement the algorithm you just
need to know what the algorithm is and then on silicon we’ll implement the best algorithm we can right
you know of the of the brain but we don’t have to have the exact architecture of the brain
to do that okay does that make sense that that’s my what i’m trying to say
you know let’s implement the algorithm but not necessarily the architecture okay so when i say sparsity i really
mean sparsity algorithmic sparsity right and it doesn’t mean that you have to
have a very sparse kind of you know silicon vlsi circuit to do this that’s
not the case yeah given that we that that’s a good segue given that we
do have the flops right that we don’t have in the brain it naturally it is a
different a different system we do have terraflops petaflops even in these giant
compute clusters where should we put them in your opinion like where where should that extra
resource that the brain doesn’t have go should it go into sequentially executing
what the brain executes in parallel or you know where should we put that so first i want to say
is that we have those flops but they’re costing us a lot and you
just have to open the papers to see what the cost of the flops is it’s enormous
an enormous energy drain and it’s also an enormous uh architectural drain on
what we’re doing and so i would say we want to get rid of the flops because probably we don’t need them okay and
especially as you go from the data center down to the edge you get the your capability of
delivering flops comes directly at the you know if at the edge you can put the sorry in the data center you can put you
know your google um data warehouse right next to a waterfall or whatever you want
right to a source of energy right when you’re doing this on your cell phone or on a tiny device at the edge every
little uh bit of energy that you waste is critical for you right and so what we really want
to do is move away from the flops and move more towards the very energy efficient way the brains work because
this adding more flops is a momentary thing for us right so yes we can do this
but at a very high cost and no we don’t want to do this forever we want to find ways to cut the cost reduce the compute
and and and there’s a little other thing that i want to say and that is architecturally
we generate the flops by running right now at least by running many many many
tiny cores thousands of tiny cores typically right in an arc in
architectures they require a lot of connections to the memory this high band with memory and this thing doesn’t scale
so in a sense we’re trading flops for memory if you use the cpu today you
could get a terabyte on your desktop but go get a terabyte on a gpu right and so
boosting the flops is going to enable us changing the architecture if we don’t need so many flops then we can actually
increase the size of our memory which will make us able to hold these giant models that we want to do very cheaply
if you will if i explain a deep neural network to someone i usually you know
you start with a fully connected layer you say you know here is a layer of neurons and here is a layer of neurons
and they have their connections right and each connection has a little weight and so on you usually describe like a
dense fully connected architecture and that is conceptually i want to say easy
to grasp for people and so on do you have an analogy for sparse architectures
like what is the conceptual like could you conceptualize
to someone who doesn’t know what like a sparse architecture is and how to think about it what is different
yeah the way we do sparsity today i don’t know what it’ll look like in the future but but today sparsi looks like
imagine that the two layers of the neural network are these kind of there are chords from one
layer to the next right there springs attached and these are of course these are the connections the weights that
we’re using in the computation right and varsity means i take scissors and i chop
chop chop chop chop you know until i have five or ten percent of those chords left right and those chords it turns out
right if i do this right if i do this kind of pruning right are good enough to capture
right the uh accuracy of the model as it was before because a lot of the
connections are not important for this process that’s kind of the big discovery and
modern research in in techniques for for sparsification right um you know
play along this kind of game so you can do this kind of unstructured thing that i just described where you arbitrarily
cut in many places based on on the effectiveness or you can also structurally take things out so in a lot
of the modern models right we’re removing pieces that are not necessary we do architecture search to find these
uh these uh places to things right so that’s where the whole game right now of
efficiency and neural networks right is the game of how do i cut this thing down
right in the brain there are certainly some systems like the visual system
where that is clearly organized into layers but there are many other systems that have no
resemblance to layers there are connections going up and down and left and right and you know between the the
halves of the brain and all is there a possible future where this
could become into like a standard architectures for neural networks that the notion of layers and
things like this isn’t even really a you know a thing anymore or is there you
know some some fundamental way where we say no there’s probably always going to be layers but it’s just going to be
sparsity between those layers so when we look at you know we have a full
connectome of essentially only a couple of animals a worm and a fruit fly that’s it and as
that said don’t see a lot of layering there it looks more like a mess very sparse mess okay
um and um i would i i wouldn’t venture to think about how
what cortex what a cortex looks like right um we don’t have that yet we’re working
very hard to it’s a very these are very hard computational problems to be able to
to go and get a model we just want to do a mouse even a mouse is just too big for us to do right now like a small mammal
right but my i would venture to guess that yes the answer is that you know
it’s extremely it’s an extremely sparse architecture and that it wouldn’t it will not look like layers
okay you can impose a layer structure on any graph okay it’s not so the idea that i
say there aren’t layers sure okay i can take the graph and i can layer it yeah i can do a bfs on it and
layer it but but the point is not so much that it’s more that by design when
i think about it right i’m not going to think about it as a sequence of layers where the change that i make is the
change in the layer one layer is different from the other but rather it’ll be a combination of thinking about
paths different paths and i’ll do different things along different paths
that’s kind of the idea you know if you think about you know there’s there’s
recent research from mit you know you can detect um people can detect an image
in in 0.13 uh set 0.013 seconds in 13 milliseconds
okay in 13 milliseconds you can detect that you can say what an image is okay
this is there’s no time for neurons to fire this thing is is extremely kind of
parallel right and uses very little compute and gets you an answer and and a
large part of that is prediction because you’re already expecting something so we need to learn how to do
those things and so machine learning right now is in a very naive early stage
and so given that and given the things that we are doing right now it’s not it’s not a surprise that we’re doing the
brute force kind of massive compute kind of thing that’s always what you do and with time we’re going to get better and
better at it right so that’s kind of how i see this progressing
speaking of becoming better uh if you know the flat worm is sparse the mouse
is sparse the human is certainly sparse yet our best models today are all big
dense you know computation hungry things there is not really a case every time i prune i
sparsify and so on i get savings in per like you know savings in cpu or gpu i
get savings in you know my storage but i also get like a little bit worse right
that’s the common thing today in pruning is that i get like just a tiny bit worse than the
dense model i prune from why do you do you think that is just the fact that we prune from a dense model or
what’s holding back the sparse models so how about if i if i turn this around
let me turn this around for you okay you can take you can take bert uh
bass which is a common model that people uh use okay and you can sparsify bird
base um at neural magic we sparsified 95 so a
95 sparse bird base 1 over 20th of the compute okay way beyond anything a gpu
does even if you run it with full throttle okay it’s just cutting the compute so much that there’s really
almost nothing to compute there it’s just moving data okay no i’m exaggerating of course but but you know
it’s really becomes a data movement problem rather than a compute problem when you when you and and you lose
one percent less than one percent accuracy okay um and i say okay great so you’ve done
that you know and you’ve gotten all this uh speed up but you’ve lost you say oh near but you lost less one percent
accuracy but what i say instead is forget that take bert large a much more accurate
model several points more accurate than bird-based okay and prune it so that it
actually right with 20x less compute it’s actually faster than birthdays okay
and so now you have the accuracy right and you have great compute and
this is through sparsity so by sparsifying the larger model i actually delivered you the best of both worlds
little compute and great accuracy and that’s how i want you to think about sparsity right it’s a way of enabling us
to run much larger more accurate dense models but because we specified
them we are you know we’re getting great performance that’s how to think about it
what’s the limit currently that keeps us from we always need the dense model first in
this model in the pruning in the pruning setup we first need the dense model then we go to the sparse model we get
huge savings at inference time what keeps us from just building the sparse model in the first place
great so this is kind of the lottery ticket kind of question if you will um
there is research actually dan alistair one of our uh consultants uh neuromagic works exactly
on this kind of stuff we know how to um to run um a training session right now
four four models where you start out and you need to do only a certain fraction of the um you
know of the uh forward passes backward passes dense and then immediately you can already start pruning while training
so so there is research going in that direction but you are right that right now at least right in the in the
standard if you look at what’s going on there out there standardly you’re right we do most of the time take a standard
model and and from dents we sparsified and so on but
but the thing to remember and this now i’m not talking about the research because the research is going to get there you know yannick i don’t know if
to what extent we will uh how fast this will happen and so on but we will learn how to build sparse architectures and
start sparse and continuous you know it’s it’s really a matter nature does this and so there’s no
reason why we’ll be able to do it but i want to say something about today’s uh machine learning where where you kind of
start with the dance and then you have to sparsify this is really not the common paradigm
for most users of neural networks for most users a model is is given to them
that you know from a from a known architecture right and then they transfer learn onto it and most people
do that rather than train from scratch they really use the model that somebody already worked very hard to build for
their specific use case and then they transfer learn onto it so this is what you can do with sparsity you can take a
sparse model and sparse transfer learn onto it it’s extremely efficient because you’re running at the speed of the
sparse network right so you can sparse transfer and then you don’t need all of
this kind of start with dents and and we’re seeing more and more sparse networks um appear you know in
the in the in the literature and the data in the you know in database collections of
machine learning models and as we have more and more of these initial good sparse models right people are going to
learn to start with the sparse already that’s kind of commercially i think that’s what we’re going to see more and
more of yeah why you mentioned this a bit already but why
are gpus so unsuited for sparse models and what
makes cpus in the way you do it really suited for sparse models or are
they even suited or are you simply you know seeing that they’re better
yeah i mean look the the gpu architecture you know is is designed for this very you know
small course tiny caches you’re not going to go and throw all that away
just because you know you found you discovered sparsity so you’re trying to do sparsity while keeping this kind of
lockstep execution structure right and this is difficult to do sparse you need
you need uh you you need you need really a different kind of setup to get an
advantage out of sparsity now now i’m not i it’s not like you can’t do
that right it’s not like you can’t do that people can design and have design
hardware that utilizes sparsity efficient okay there is such hardware it’s just
not a it’s not gpu-like it’s not like the accelerators that we have today um
but all of these again all of these accelerators have a different problem that has just to do with the memory
because of the way they’re designed right they typically have very small memories so we’re talking even even ones
that can run sparse right still have the limitation of their memory size so the reason that cpus are attractive is not
so much that you know that they that you have a natural way of running sparsity because you can run
asynchronous with large cores but rather that the large cores enable you very easy access to very
large memory pools right so the advantage of having strong powerful
pores right is really that i can put several terabytes of memory next to them
right and run easily and that’s where the big advantage is going to be as we understand more and more about how to
build giant models that don’t run all the model layer by layer at the time
right then the compute will be less important but actually the ability to hold that model in one place and run it
rather than break it apart on 8 or 16 gpus that’s going to be your advantage
and so this is so i’m kind of saying it’s not so much that you can’t build a hard piece of hardware to run sparsely
you can right but you should build it looking like a cpu in the sense of you
can access a lot of memory because you’re not doing tiny cores that’s kind of
that that’s my two cents so the the cpus are good because they have you know fast
connect to large memory but also over the years we’ve put more and more levels of cash onto the cpu how much do you
have to have to take this into account when you’re building i mean you’re maybe you can explain a little bit what your
company does in terms of software you build compilers or can i just run tensorflow
or something yeah so so let me explain so so so first of all the the the connection between
the cpu and the memory is slow gpu has faster memory and faster access to it
right smaller but faster right cpu memory is slow but large very large uh
but cpus have a cache hierarchy as you said and so if you you know how to utilize your cache hierarchy then you
know if you’re running in the l1 cache of the cpu okay you’re running as fast as the gpu there’s nothing there that
the gpu does that the cpu can’t do once you’re in cash okay in fact cpu caches are much faster than gpu caches and the
performance is better so so the so the question then right and this is what neural magic does is okay
so what we do is we sparsify the model now you know if if the pro you know machine learning is about okay
i need to meet a certain latency and because i couldn’t meet that latency with a cpu
then we added the gpu and boom there’s machine learning with gpus now i can meet the latency but there’s two ways to
deal with latency one is to add more flaps and the other is to reduce the flops right and so sparsity instead of
adding more flops and hardware reduces the number of flops needed in software but now that you have this very sparse
model because the cpu memory is slow okay then what happens is you hit a
bottleneck and it’s very hard to move if you do this layer after layer it’s very hard to move the data in and out okay so
what neural magic invented is a way of running neural networks depth wise so we have the this technology which we call
tensor columns where essentially you can run okay you know you can break the model lengthwise and run you know each
one of these kind of columns you know um in cash okay and you because you’re not leaving
l2 really you’re rarely leaving l2 you know you actually get great performance
so in a sense right what we’re doing is we’re using the natural ability of cpus to prefetch things from memory and then
run in cache and because this you know this cache hierarchy on cpus has evolved
over 70 years or i have maybe i’m exaggerating 60 years of hardware design it’s a very very
well understood thing where people know how to optimize it right especially the big uh you know chip makers they really
know how to make these caches work really well and so with these really good cache
hierarchies um you really get great uh performance by
running the model depth-wise so that’s neural magic you know we take the model sparsify it now it doesn’t need the
compute and now we run it on the cpu and get speed because we’re running in cash okay and if you look at the numbers i
mean you know we we are you know at the speed of of i mean some numbers we haven’t punctured we’re
at the speed of an a100 even faster in terms of how long it takes a four core cpu can in terms of latency do what a
a100 does on a common model like bird okay so it’s really the the amp given
that it’s sparse or yes yes yes by sparsifying it and running it you can make a four chord do what a100 does so
it’s really now a matter of throughput and the a100 has a lot of throughput okay so now the question is you know how
many cores do you want on your cpu to meet the throughput of the a100 and again the story is that you know the big
providers are adding more and more and more cores so you’re going to be able to compete better with the gpus
down the road so that’s kind of the the the story of neural magic yeah
so the way i can imagine these these tensor columns is that because i execute depth wise the sort of values that i
need for the next step in the computation are the results of the very last step therefore are already going to
be in cache and since everything sparse i don’t i don’t need all of the last
layer for the current step and therefore you know i have it already okay right and of course i’m i’m i mean you know
when you think about a neural network there are overlaps between these columns and the question is how do you deal with
the overlaps in a way that doesn’t kill your computation and that’s the magic that’s the magic of it there’s an
algorithm that allows you to do that and because you can do it you manage to run this way and you don’t hit this memory
bottleneck and boom you’re in business yeah so for
gpu it’s almost like you know gpus enable us to do dense models but i think
also models have almost co-evolved with the gpu so people have started building
models to fit the gpu architectures better right especially something like a transformer is like that’s a that that’s
like made for gpus um is there a type of sparse model like if you if you could
wish for the best possible sparse but you know there’s different kinds of sparsity like what is the best
type of sparsity to let’s say execute on a cpu if we want to look forward and we
want to especially build architectures for that yeah this goes back to your original for
one of the first questions you asked right it’s about it’s about a different structure for the neural network execution so we should forget the
synchronous layer after layer execution and think about the fact that you know
we can run through a model right in multiple paths with multiple computing
units use the same weight structure and so on of the model right but run at
different speeds and by running at different speeds and and going through the model in different paths i can get
from the same model multiple answers to my questions which is kind of what i i believe what your
brain does so what happens there is you have this network but it’s not like you
know it’s all firing like this layer after layer it’s rather you have these asynchronous flows going
through it right even going through matching pads and cpus are naturally
built for this thing now i’m not saying that somebody can’t build a beautiful fpga that will perhaps have a better
closer structure to what a brain does maybe so but but you know but there is an advantage
to being commodity okay the fact that the cpu can do other things is a big win
if i can make if i can move everything to software is really is the thing then i can really get all the advantages of
modern software so i’m not pulling hardware accelerators i’m saying great you know
they have a role and so on and so forth but they come at a price right and the price for any organization is that you
instead of just downloading or shipping your product with the machine learning piece you have to ask the client to buy
a certain accelerator or run it with a certain accelerator and this all goes away if we can figure out how to make
the cpus do what the gpus do right then we have then we’re back into this
beautiful world of containerized movable software and that’s really
kind of where i would love machine learning to move to rather right that we would have and maybe down the road right
there is this you know you know cpus have have a history of absorbing the key components of any new
paradigm that shows up you know virtualization started out with tricks on a g on a cpu and then later on added
the features networking had special accelerators and then they moved into the cpu and i’m expecting that whatever
features are necessary for machine learning to run well we’ll move into the cpu and we won’t need an outside
accelerator to make this thing work if you could um
so i think that’s by the way also the story of gpus themselves right they were already kind of consumer-ish available
and then they can’t they they absorbed machine learning it’s not necessarily the best architecture for machine
learning but let let’s say let’s say there’s already all this hardware out there right there is very good cpus next
to very good gpus how do we get the best out of a machine like this
right right now we’ve advocated for let’s move things to the cpu right we have some advantages there but what if i
have a box with both like currently i just use my cpu to ship data to the gpu
right that that’s what my cpu does but is there a way where i could potentially
you know what kind of architecture would make the best use out of a combined
system of cpus and gpus no i think this is really the vision that nvidia has at least today for their
grace hopper architecture it’s essentially this there will be a cpu and a gpu connected to one another and the
cpu will do all the things that are memory intense and the gpu will do all the data intense things the thing about
the problem with this kind of a model is it’s a beautiful model by the way i’m not saying anything uh bad about this if
you if you really want to build a gpu world that’s a great thing to do but again
the you know how you how much you utilize your gpu your attached gpu
has to do with how you write your application because you need to move the data into the gpu in and out and that’s
slow right you remember it’s like it’s exactly like going to memory right it’s
the gpu is not up it’s not sitting in your in your caches so if you’re on the cpu and you’re computing something on a
cache and suddenly you get a page fault and you have to go and get something from memory that’s the latency that the
gpu introduces here right and so if if you’re going to design it with that
you have to create really good software to pipeline things and this is at the level of the application
so the application programmer has a big programming task and so this is a great
solution for large-scale big projects where okay
i’m gonna facebook is gonna get you know a thousand of these or ten thousand of these whatever it is you know uh or or
google ten thousand a hundred thousand of these and you put them together with then it’s worthwhile to write this kind
of complex software but if you’re joe company right and you have your little thing i don’t think you want to be
writing that interface right so so kind of so i’m saying it’s a it’s a it’s great
for large things right data center things big things but i’m very doubtful if this is
going to be um effective at the edge if you can
um actually utilize the cpu for it okay and and i will say one more thing
and that is that um you know that
the modern way that the designers of hardware think about it is that is mod it’s
built-in modules if you look at the if you look at the amd latest architecture right essentially you have these ccx’s
so so the machine even though it has you know maybe 40 or 50 or 60 cores
right they’re grouped into groups of eight right and each group of eight like this is a little piece of the die okay and i
think intel is shifting in that direction too so nothing’s to prevent you from making pieces of that die be
specialized pieces of hardware like a gpu you don’t have to have outside device so if you ask me what the future
is going to look like it’s probably going to look like you know these large cores right that have um or large
machines with with multiple dyes and on these dyes we might have a gpu die we might have accelerated and that’s more
like what i expect to happen rather than having a massive you know accelerator on
the side if we if if we hear sparsity and uh things not
being in layers and so on naturally the topic of i think graph neural networks is very close to that at least in the
imagination of people do you have anything to say about you know where current graph neural networks stand with
respect to sparsity yeah i would think of graph neural networks as a as a as a different kind
of okay so so graph neural networks i i use some some graphical networks in my
research and the and the idea there you know is that you know we can use graph neural
networks to solve graph problems that otherwise would be very complicated to
solve if we tried to solve in brute force okay now it’s not generally applicable
there are quite a few limitations but but as a tool i would say
that you know rather than think about the neural network itself as being looking like a graph neural network
right i could use graph neural networks right um to define um what we call motifs in
the neural network so for example when we try to look at at how brain struct
brains are structured right when we look at the graphs of brains and we try to understand you know is there a motif
that is repeating itself in this graph right then using a graph neural network for that is a really nice way to try to
find these motifs okay efficiently right um because the problem itself is
is p space complete or actually we don’t know it’s it’s a graph isomorphism so so
clearly we don’t know right how to do the brute force algorithm well but but the graph neural network can
come to our aid here and so so i would say that right now i don’t really see
a a real network design neural network design that is specific to that or a way that
it helps but but in research it definitely works and we really want to use these networks to help us in
research um this might be a bit of a tech bro
question but if i hear you know i can do sparse computation very i can reduce the
flops and so on um is there any intrinsic connection between the
sparsification of neural networks the non-layer-wise computation and blockchain technology and smart
contracts and distributed computing and things like this have you ever given this any thought or or
uh is that completely off yeah look i think nothing is completely off
with respect to maschine in the sense that i am sure that machine
learning will find its way into into all of those areas
right it’s a matter of time and um and right now right the all the work
there doesn’t need the efficiency of right of what machine learning offers
because machine learning in the end is an optimization and so when i think when all these blockchain
algorithms and all you know become more commonplace and we need to provide them
with things like security further security or analysis and so on i think then we’re going to see applications of
machine learning there and with that i think all these things of sparsity and so on
are going to are going to appear but you know but but for me right
it really is the whole story of sparsity right is the story of uh of a phenomenon that is very
prevalent in nature right that may you can say surprisingly or not
surprisingly shows up in machine learning and it kind of it makes me feel like it’s strengthening
my my belief right that even though the exact computations that we’re doing are
not the same as spiking neural networks and brains right that there is a lot of commonality there
and the emergence of these similar phenomena like sparsity like you know pruning and so on and the fact that we
can get benefits from it this tells me oh okay these are related i think that’s a very important
uh interesting point to keep in mind with neural magic who is your main
target audience like who who is listening to this do you want to let know like we are exactly for you so we
span the gamut from the data center to the edge um i would like to say i mean we just now
are moving into providing the same properties for arm architectures and so i would say the
exciting new thing at neuromagic is we’re moving from doing this you know uh for amd and intel architectures to doing
it for arm which means that we’re going to spam again all the way to the very bottom of the of the food chain if you
will and i think this is very exciting because as you know because because sparsity has a dual role as you go down
the food chain right because for the large accelerating you know the the fact that the memory footprint is large or
small is not that important but as i go down sparsity gives me two things speed with neural magic gives you speed but it
also makes the model extremely small so you’re getting a small accurate model right running on a very small device and
this you know typically is an arm device and so that’s that’s that’s the audience
that i’d like to say hey we’re coming you know we’re coming and we’re going to deliver the same things that we can deliver for intel and amd we’re now
going to deliver it for arm at the very end at the very edge if you say edge do you mean
smartphones do you mean security cameras do you mean robots everything okay
i mean everything i’m not not like i’m going to do everything to start with but but yes yes we’re aiming in that direction yes
and with the danger that this has become going to become like a a marketing opportunity question but how easy is it
to get started with what you’re doing like let’s say i’m a i’m i’m like i’ve
done you know my tensorflow tutorials i know how to build a model and train it and so on like
how much does it take for me to transition or to to apply what you’re doing
yeah so you just go to our website go to get go to get download deep sparse our
you know our engine download our uh ml tooling and um you know immediately you
just either pick a sparse model and transfer learn onto it with our tool so we have recipes you have a model you have a
recipe exactly what you would do if you went to hugging face and downloaded a model and downloaded a recipe you do the
same kind of thing and you sparse transfer learn onto it and you’re in business so
it’s not very hard so i think this is really and we’re working on making it even even easier
this is one of our goals right is to make it really really easy to do this and the advantage of course is that you
know people are already busy uh you know quantizing their models to
get more performance so this is like quantizing in some sense right you’re going to do the same kind of thing
and get yeah a lot more performance yeah is there a type of model where it works
particularly well and the type of model where it doesn’t like i’m thinking you know convnets recursive networks other
regressive maybe you know the big language models like what what is it best at
yeah so right now you know it’s best that at bird yolo models we do we do
computer vision and we do uh and we do the language models but not the large language models we haven’t done a large
language models yet so for those types of things like the birds and the yolos and the you know the uh whatever the
variants of efficient nets and all these guys this is you know visual transformers these are the things that
that we do right now and and every all our technology is right now
you know available for those um i’d love to do the large models a cpu is a natural environment
for running the knowledge models you know these giant models these trillion or whatever parameter
models that people talk about splitting across 16 gpus they fit on your desktop
okay so clearly a cpu is a natural place to run a very large model okay and
so that’s that will be a target but right but not right now okay very exciting uh is there any last
things you want to get out maybe about neural magic or sparsity in general well you know our our whole machine learning
software stack is open source and we’d love people to come in and help us build
you know better sparsity use sparsity in their models and and tell us about what
they’re doing and you know that it would we have a community and we’d love you to join our community
excellent near thank you so much for being here today this was very pleasant
How to make your CPU as fast as a GPU—Advances in Sparsity w/ Nir Shavit
Link post
Great video running through the promise of sparse-accelerated machine learning. Important capabilities news for interpretability folks, though this is hardly the only way to get it. 30min on 2x speed.
today i’m talking to near chevit about sparsity near has been long time active in the field as a professor at technion
and mit and has also been awarded with various prizes such as the good old prize in 2004 and the dykstra price in 2012.
he’s also founder of a company called neural magic that questions one of the fundamental core principles of
current machine learning namely you need gpus neural magic uses various
techniques such as sparsity which we’re gonna talk about today but also other optimization techniques to make
inference on models like bert to be as fast as a gpu on a regular cpu this is
pretty huge and can have vast implications on where you can deploy these models and just how expensive it
gets to roll them out to many people in many places so today we’ll talk about the biological foundations for sparsity
why we shouldn’t attempt to replicate the brain and just what it takes to make something go really fast on just the cpu
i hope you enjoyed this conversation if you do give nier and his company a follow and i’ll see you around bye-bye
hi this video is sponsored by assembly ai assembly ai does real-time and batch
audio transcription of audio and video files powered by the latest advances in
artificial intelligence so if you are a developer or work for a company that’s looking to get more out of your audio or
video data through transcription and audio intelligence assembly ai is the best place to go not only do they have a
user interface where you can just upload stuff but they do have a very powerful api but transcription isn’t all they do
once your audio is described they actually post-process it in many different optional ways so they can do
things like speaker classification or annotations of various forms inside of your audio one feature i’d like to
particularly highlight today are the auto chapters for this simply provide auto chapters equals true on your upload
and assembly ai will after it’s transcribed your audio automatically recognize chunks of audio where you talk
about the same thing give you a summary of those chunks and a neat single description headline of what you were
talking about there this is absolutely ideal for anyone who does any sort of long-form podcasting or videos like mine
where viewers are very very helped by the fact that there are chapter annotations and to have these be done
automatically is just absolutely great so if you’re interested head on over to assembly ai use the link in the
description to let them know that i sent you they are the single api to transcribe and understand audio they do
so in batch and in real time via websocket they accept all kinds of audio and video formats and they do so in over
15 languages give it a try and thank you very much to assembly ai for sponsoring this video and now let’s get into the
video the topic of sparsity is a big thing in neural networks right now mostly because
we have no idea really how to do it and i think that’s exciting times for the
future so uh welcome what what brings you into the sparse world actually i um
you know i’ve been a professor of computer science for many years and i um
worked on multi-cores for more than 30 years and
got involved in computational neurobiology in the last 10 years
and one of the things that you really see in the brain is really how sparse
its computation is it really is very very sparse and and so
you know looking at neural networks we see that there are there’s a similar phenomenon to what happens in
brains happening in neural networks right where you can actually reduce the
number of parameters through pruning by huge amounts and preserve accuracy of
the performance of the network and that kind of says okay if we really want to
have brain like performance you know sparsity is probably one of the tools
that we want to use to get there so that’s kind of how i kind of got into this uh
yeah and you founded a company that also
works into this direction right you want to talk about that a little bit yes i founded neural magic
neural magic was founded because what we were seeing in my lab i was busy with
doing machine learning at a large scale for biology projects and what we realized
was that we could get cpus to run at gpu speeds like at the time it was a pascal
gpu and we could make just a regular cpu do what the pascal gpu was doing um
through the use of of sparsity and other similar uh techniques and so we said
okay well there’s a real commercial value here for people because you don’t need an accelerator you can just do it
on your commodity cpu and that’s that’s normal magic so what we do is we deliver
you know through sparsity and similar optimization techniques um gpu performance on cpus that is is quite a
promise maybe let’s first dive into a little bit about sparsity itself what is it about sparse you mentioned the brain
is very sparse yet our current or at least the way we train neural networks is very dense we
can accelerate the dense neural networks much better what is it about sparsity is it just the saving of parameters or is
there something more to sparse connections than to dense
connections what do we know that’s a good question so clearly what we’re doing today is not
the sparsity that we will be doing in the future what i mean by that is your brain is sparse way beyond the levels of
what we see in neural networks today so your typical brain in terms in terms of the compute
right you know your cortex is like a cell phone of compute right but the graph is enormous it’s like you know the
graph is is the size and you need petabytes to basically hold it so so a
cell phone of compute on a petabyte or more of memory right but the
accelerators that we build you know are designed to deliver petaflops of of
compute but on a cell phone size memory their memory is very limited because they use this high bandwidth memory so
so in a sense we’re building the opposite of what we want right so if we want to mimic the brain
we should not busy ourselves so much with the amount of compute and rather worry about how it is that we implement
this very large graph it’s a very large graph but it’s extremely sparse that’s the point right
and as you asked the sparsity is not necessarily the same sparsely that we do today through pruning techniques but
it’s a combination of a very sparse architecture together with um you know a sparsity in
what we call in machinery in the kernels right so it’s not just that the kernels are sparse but everything in the in the
design is very very sparse okay and we don’t know yet how to design
very sparse architectures part of that has to do with the fact that machine
learning grew up in the gpu world where sparsity is not an advantage actually because
you’re doing lockstep computations so you win nothing by being very sparse and
therefore you know we don’t we don’t see those architectural sparsity things yet
but um but i’m expecting that to happen we should be this should
come along you know and and even more than that what i expect is
things are starting to show up like the the pathways from models from google and so on where
even if you have a very large model you don’t execute the full model layer
after layer but rather you execute small regions of the model at any given time
per input that’s another form of sparsification of your computation right
and that is what the brain really does so your brain typically you know when you see an input or so on
uses a very small fraction of its total graph to do the computation
and so that’s where we’re headed we’re not there yet we don’t know how to do it but but this is the goal
and that’s the old you only use 10 of the brain
at any given time right yeah right that’s right i mean really from from energy considerations it really is like
a cell phone okay it really isn’t you know this massive monster multi-gpu
thing that we use today and so my expectation is that you know
that as we learn more and more about how to design sparse networks we’re going to
see them become the standard they’re not the standard right now because we started the whole journey right by
applying flops and still applying flops is the the main paradigm
but we will see it appear both in hardware and accelerators and in cpus
um this idea that we can utilize sparsity you know to get really great
performance gains yeah that’s coming now is the question is a little bit the
chicken and the egg problem is the brain sparse because it has the limitations of
the cell phone power or does the brain only need cell phone
power because sparsity is such a good architecture right like which which causes which
yeah um so so i would say that
you know the whole notion of parallelism in the brain right um if you think about
it imagine that you need to do a billion operations per second okay and what you have are these very
slow chemical devices neurons right that can do that right so you need a billion
operations a billion you know firings of neurons in a second how are you going to do that well what you need is massive
parallelism right you’ve got to get massive parallelism if you can do the massive parallelism you can get the
billion operations right and and and so our brains are parallel
if you will because we have this special medium right now on a modern
multi-processor right you can get a billion or 10 billion instructions executed you know per second
sequentially you don’t really need parallelism for it right and so what i’m trying to say is you know the whole idea
of of kind of how brains evolved is clearly because of the way you know
they’re implemented but we should not think of of going and implementing this in in uh
in silicon in the same way right because we really what we really should think about just is that both of these things
are turing complete right you can do you can implement the algorithm you just
need to know what the algorithm is and then on silicon we’ll implement the best algorithm we can right
you know of the of the brain but we don’t have to have the exact architecture of the brain
to do that okay does that make sense that that’s my what i’m trying to say
you know let’s implement the algorithm but not necessarily the architecture okay so when i say sparsity i really
mean sparsity algorithmic sparsity right and it doesn’t mean that you have to
have a very sparse kind of you know silicon vlsi circuit to do this that’s
not the case yeah given that we that that’s a good segue given that we
do have the flops right that we don’t have in the brain it naturally it is a
different a different system we do have terraflops petaflops even in these giant
compute clusters where should we put them in your opinion like where where should that extra
resource that the brain doesn’t have go should it go into sequentially executing
what the brain executes in parallel or you know where should we put that so first i want to say
is that we have those flops but they’re costing us a lot and you
just have to open the papers to see what the cost of the flops is it’s enormous
an enormous energy drain and it’s also an enormous uh architectural drain on
what we’re doing and so i would say we want to get rid of the flops because probably we don’t need them okay and
especially as you go from the data center down to the edge you get the your capability of
delivering flops comes directly at the you know if at the edge you can put the sorry in the data center you can put you
know your google um data warehouse right next to a waterfall or whatever you want
right to a source of energy right when you’re doing this on your cell phone or on a tiny device at the edge every
little uh bit of energy that you waste is critical for you right and so what we really want
to do is move away from the flops and move more towards the very energy efficient way the brains work because
this adding more flops is a momentary thing for us right so yes we can do this
but at a very high cost and no we don’t want to do this forever we want to find ways to cut the cost reduce the compute
and and and there’s a little other thing that i want to say and that is architecturally
we generate the flops by running right now at least by running many many many
tiny cores thousands of tiny cores typically right in an arc in
architectures they require a lot of connections to the memory this high band with memory and this thing doesn’t scale
so in a sense we’re trading flops for memory if you use the cpu today you
could get a terabyte on your desktop but go get a terabyte on a gpu right and so
boosting the flops is going to enable us changing the architecture if we don’t need so many flops then we can actually
increase the size of our memory which will make us able to hold these giant models that we want to do very cheaply
if you will if i explain a deep neural network to someone i usually you know
you start with a fully connected layer you say you know here is a layer of neurons and here is a layer of neurons
and they have their connections right and each connection has a little weight and so on you usually describe like a
dense fully connected architecture and that is conceptually i want to say easy
to grasp for people and so on do you have an analogy for sparse architectures
like what is the conceptual like could you conceptualize
to someone who doesn’t know what like a sparse architecture is and how to think about it what is different
yeah the way we do sparsity today i don’t know what it’ll look like in the future but but today sparsi looks like
imagine that the two layers of the neural network are these kind of there are chords from one
layer to the next right there springs attached and these are of course these are the connections the weights that
we’re using in the computation right and varsity means i take scissors and i chop
chop chop chop chop you know until i have five or ten percent of those chords left right and those chords it turns out
right if i do this right if i do this kind of pruning right are good enough to capture
right the uh accuracy of the model as it was before because a lot of the
connections are not important for this process that’s kind of the big discovery and
modern research in in techniques for for sparsification right um you know
play along this kind of game so you can do this kind of unstructured thing that i just described where you arbitrarily
cut in many places based on on the effectiveness or you can also structurally take things out so in a lot
of the modern models right we’re removing pieces that are not necessary we do architecture search to find these
uh these uh places to things right so that’s where the whole game right now of
efficiency and neural networks right is the game of how do i cut this thing down
right in the brain there are certainly some systems like the visual system
where that is clearly organized into layers but there are many other systems that have no
resemblance to layers there are connections going up and down and left and right and you know between the the
halves of the brain and all is there a possible future where this
could become into like a standard architectures for neural networks that the notion of layers and
things like this isn’t even really a you know a thing anymore or is there you
know some some fundamental way where we say no there’s probably always going to be layers but it’s just going to be
sparsity between those layers so when we look at you know we have a full
connectome of essentially only a couple of animals a worm and a fruit fly that’s it and as
that said don’t see a lot of layering there it looks more like a mess very sparse mess okay
um and um i would i i wouldn’t venture to think about how
what cortex what a cortex looks like right um we don’t have that yet we’re working
very hard to it’s a very these are very hard computational problems to be able to
to go and get a model we just want to do a mouse even a mouse is just too big for us to do right now like a small mammal
right but my i would venture to guess that yes the answer is that you know
it’s extremely it’s an extremely sparse architecture and that it wouldn’t it will not look like layers
okay you can impose a layer structure on any graph okay it’s not so the idea that i
say there aren’t layers sure okay i can take the graph and i can layer it yeah i can do a bfs on it and
layer it but but the point is not so much that it’s more that by design when
i think about it right i’m not going to think about it as a sequence of layers where the change that i make is the
change in the layer one layer is different from the other but rather it’ll be a combination of thinking about
paths different paths and i’ll do different things along different paths
that’s kind of the idea you know if you think about you know there’s there’s
recent research from mit you know you can detect um people can detect an image
in in 0.13 uh set 0.013 seconds in 13 milliseconds
okay in 13 milliseconds you can detect that you can say what an image is okay
this is there’s no time for neurons to fire this thing is is extremely kind of
parallel right and uses very little compute and gets you an answer and and a
large part of that is prediction because you’re already expecting something so we need to learn how to do
those things and so machine learning right now is in a very naive early stage
and so given that and given the things that we are doing right now it’s not it’s not a surprise that we’re doing the
brute force kind of massive compute kind of thing that’s always what you do and with time we’re going to get better and
better at it right so that’s kind of how i see this progressing
speaking of becoming better uh if you know the flat worm is sparse the mouse
is sparse the human is certainly sparse yet our best models today are all big
dense you know computation hungry things there is not really a case every time i prune i
sparsify and so on i get savings in per like you know savings in cpu or gpu i
get savings in you know my storage but i also get like a little bit worse right
that’s the common thing today in pruning is that i get like just a tiny bit worse than the
dense model i prune from why do you do you think that is just the fact that we prune from a dense model or
what’s holding back the sparse models so how about if i if i turn this around
let me turn this around for you okay you can take you can take bert uh
bass which is a common model that people uh use okay and you can sparsify bird
base um at neural magic we sparsified 95 so a
95 sparse bird base 1 over 20th of the compute okay way beyond anything a gpu
does even if you run it with full throttle okay it’s just cutting the compute so much that there’s really
almost nothing to compute there it’s just moving data okay no i’m exaggerating of course but but you know
it’s really becomes a data movement problem rather than a compute problem when you when you and and you lose
one percent less than one percent accuracy okay um and i say okay great so you’ve done
that you know and you’ve gotten all this uh speed up but you’ve lost you say oh near but you lost less one percent
accuracy but what i say instead is forget that take bert large a much more accurate
model several points more accurate than bird-based okay and prune it so that it
actually right with 20x less compute it’s actually faster than birthdays okay
and so now you have the accuracy right and you have great compute and
this is through sparsity so by sparsifying the larger model i actually delivered you the best of both worlds
little compute and great accuracy and that’s how i want you to think about sparsity right it’s a way of enabling us
to run much larger more accurate dense models but because we specified
them we are you know we’re getting great performance that’s how to think about it
what’s the limit currently that keeps us from we always need the dense model first in
this model in the pruning in the pruning setup we first need the dense model then we go to the sparse model we get
huge savings at inference time what keeps us from just building the sparse model in the first place
great so this is kind of the lottery ticket kind of question if you will um
there is research actually dan alistair one of our uh consultants uh neuromagic works exactly
on this kind of stuff we know how to um to run um a training session right now
four four models where you start out and you need to do only a certain fraction of the um you
know of the uh forward passes backward passes dense and then immediately you can already start pruning while training
so so there is research going in that direction but you are right that right now at least right in the in the
standard if you look at what’s going on there out there standardly you’re right we do most of the time take a standard
model and and from dents we sparsified and so on but
but the thing to remember and this now i’m not talking about the research because the research is going to get there you know yannick i don’t know if
to what extent we will uh how fast this will happen and so on but we will learn how to build sparse architectures and
start sparse and continuous you know it’s it’s really a matter nature does this and so there’s no
reason why we’ll be able to do it but i want to say something about today’s uh machine learning where where you kind of
start with the dance and then you have to sparsify this is really not the common paradigm
for most users of neural networks for most users a model is is given to them
that you know from a from a known architecture right and then they transfer learn onto it and most people
do that rather than train from scratch they really use the model that somebody already worked very hard to build for
their specific use case and then they transfer learn onto it so this is what you can do with sparsity you can take a
sparse model and sparse transfer learn onto it it’s extremely efficient because you’re running at the speed of the
sparse network right so you can sparse transfer and then you don’t need all of
this kind of start with dents and and we’re seeing more and more sparse networks um appear you know in
the in the in the literature and the data in the you know in database collections of
machine learning models and as we have more and more of these initial good sparse models right people are going to
learn to start with the sparse already that’s kind of commercially i think that’s what we’re going to see more and
more of yeah why you mentioned this a bit already but why
are gpus so unsuited for sparse models and what
makes cpus in the way you do it really suited for sparse models or are
they even suited or are you simply you know seeing that they’re better
yeah i mean look the the gpu architecture you know is is designed for this very you know
small course tiny caches you’re not going to go and throw all that away
just because you know you found you discovered sparsity so you’re trying to do sparsity while keeping this kind of
lockstep execution structure right and this is difficult to do sparse you need
you need uh you you need you need really a different kind of setup to get an
advantage out of sparsity now now i’m not i it’s not like you can’t do
that right it’s not like you can’t do that people can design and have design
hardware that utilizes sparsity efficient okay there is such hardware it’s just
not a it’s not gpu-like it’s not like the accelerators that we have today um
but all of these again all of these accelerators have a different problem that has just to do with the memory
because of the way they’re designed right they typically have very small memories so we’re talking even even ones
that can run sparse right still have the limitation of their memory size so the reason that cpus are attractive is not
so much that you know that they that you have a natural way of running sparsity because you can run
asynchronous with large cores but rather that the large cores enable you very easy access to very
large memory pools right so the advantage of having strong powerful
pores right is really that i can put several terabytes of memory next to them
right and run easily and that’s where the big advantage is going to be as we understand more and more about how to
build giant models that don’t run all the model layer by layer at the time
right then the compute will be less important but actually the ability to hold that model in one place and run it
rather than break it apart on 8 or 16 gpus that’s going to be your advantage
and so this is so i’m kind of saying it’s not so much that you can’t build a hard piece of hardware to run sparsely
you can right but you should build it looking like a cpu in the sense of you
can access a lot of memory because you’re not doing tiny cores that’s kind of
that that’s my two cents so the the cpus are good because they have you know fast
connect to large memory but also over the years we’ve put more and more levels of cash onto the cpu how much do you
have to have to take this into account when you’re building i mean you’re maybe you can explain a little bit what your
company does in terms of software you build compilers or can i just run tensorflow
or something yeah so so let me explain so so so first of all the the the connection between
the cpu and the memory is slow gpu has faster memory and faster access to it
right smaller but faster right cpu memory is slow but large very large uh
but cpus have a cache hierarchy as you said and so if you you know how to utilize your cache hierarchy then you
know if you’re running in the l1 cache of the cpu okay you’re running as fast as the gpu there’s nothing there that
the gpu does that the cpu can’t do once you’re in cash okay in fact cpu caches are much faster than gpu caches and the
performance is better so so the so the question then right and this is what neural magic does is okay
so what we do is we sparsify the model now you know if if the pro you know machine learning is about okay
i need to meet a certain latency and because i couldn’t meet that latency with a cpu
then we added the gpu and boom there’s machine learning with gpus now i can meet the latency but there’s two ways to
deal with latency one is to add more flaps and the other is to reduce the flops right and so sparsity instead of
adding more flops and hardware reduces the number of flops needed in software but now that you have this very sparse
model because the cpu memory is slow okay then what happens is you hit a
bottleneck and it’s very hard to move if you do this layer after layer it’s very hard to move the data in and out okay so
what neural magic invented is a way of running neural networks depth wise so we have the this technology which we call
tensor columns where essentially you can run okay you know you can break the model lengthwise and run you know each
one of these kind of columns you know um in cash okay and you because you’re not leaving
l2 really you’re rarely leaving l2 you know you actually get great performance
so in a sense right what we’re doing is we’re using the natural ability of cpus to prefetch things from memory and then
run in cache and because this you know this cache hierarchy on cpus has evolved
over 70 years or i have maybe i’m exaggerating 60 years of hardware design it’s a very very
well understood thing where people know how to optimize it right especially the big uh you know chip makers they really
know how to make these caches work really well and so with these really good cache
hierarchies um you really get great uh performance by
running the model depth-wise so that’s neural magic you know we take the model sparsify it now it doesn’t need the
compute and now we run it on the cpu and get speed because we’re running in cash okay and if you look at the numbers i
mean you know we we are you know at the speed of of i mean some numbers we haven’t punctured we’re
at the speed of an a100 even faster in terms of how long it takes a four core cpu can in terms of latency do what a
a100 does on a common model like bird okay so it’s really the the amp given
that it’s sparse or yes yes yes by sparsifying it and running it you can make a four chord do what a100 does so
it’s really now a matter of throughput and the a100 has a lot of throughput okay so now the question is you know how
many cores do you want on your cpu to meet the throughput of the a100 and again the story is that you know the big
providers are adding more and more and more cores so you’re going to be able to compete better with the gpus
down the road so that’s kind of the the the story of neural magic yeah
so the way i can imagine these these tensor columns is that because i execute depth wise the sort of values that i
need for the next step in the computation are the results of the very last step therefore are already going to
be in cache and since everything sparse i don’t i don’t need all of the last
layer for the current step and therefore you know i have it already okay right and of course i’m i’m i mean you know
when you think about a neural network there are overlaps between these columns and the question is how do you deal with
the overlaps in a way that doesn’t kill your computation and that’s the magic that’s the magic of it there’s an
algorithm that allows you to do that and because you can do it you manage to run this way and you don’t hit this memory
bottleneck and boom you’re in business yeah so for
gpu it’s almost like you know gpus enable us to do dense models but i think
also models have almost co-evolved with the gpu so people have started building
models to fit the gpu architectures better right especially something like a transformer is like that’s a that that’s
like made for gpus um is there a type of sparse model like if you if you could
wish for the best possible sparse but you know there’s different kinds of sparsity like what is the best
type of sparsity to let’s say execute on a cpu if we want to look forward and we
want to especially build architectures for that yeah this goes back to your original for
one of the first questions you asked right it’s about it’s about a different structure for the neural network execution so we should forget the
synchronous layer after layer execution and think about the fact that you know
we can run through a model right in multiple paths with multiple computing
units use the same weight structure and so on of the model right but run at
different speeds and by running at different speeds and and going through the model in different paths i can get
from the same model multiple answers to my questions which is kind of what i i believe what your
brain does so what happens there is you have this network but it’s not like you
know it’s all firing like this layer after layer it’s rather you have these asynchronous flows going
through it right even going through matching pads and cpus are naturally
built for this thing now i’m not saying that somebody can’t build a beautiful fpga that will perhaps have a better
closer structure to what a brain does maybe so but but you know but there is an advantage
to being commodity okay the fact that the cpu can do other things is a big win
if i can make if i can move everything to software is really is the thing then i can really get all the advantages of
modern software so i’m not pulling hardware accelerators i’m saying great you know
they have a role and so on and so forth but they come at a price right and the price for any organization is that you
instead of just downloading or shipping your product with the machine learning piece you have to ask the client to buy
a certain accelerator or run it with a certain accelerator and this all goes away if we can figure out how to make
the cpus do what the gpus do right then we have then we’re back into this
beautiful world of containerized movable software and that’s really
kind of where i would love machine learning to move to rather right that we would have and maybe down the road right
there is this you know you know cpus have have a history of absorbing the key components of any new
paradigm that shows up you know virtualization started out with tricks on a g on a cpu and then later on added
the features networking had special accelerators and then they moved into the cpu and i’m expecting that whatever
features are necessary for machine learning to run well we’ll move into the cpu and we won’t need an outside
accelerator to make this thing work if you could um
so i think that’s by the way also the story of gpus themselves right they were already kind of consumer-ish available
and then they can’t they they absorbed machine learning it’s not necessarily the best architecture for machine
learning but let let’s say let’s say there’s already all this hardware out there right there is very good cpus next
to very good gpus how do we get the best out of a machine like this
right right now we’ve advocated for let’s move things to the cpu right we have some advantages there but what if i
have a box with both like currently i just use my cpu to ship data to the gpu
right that that’s what my cpu does but is there a way where i could potentially
you know what kind of architecture would make the best use out of a combined
system of cpus and gpus no i think this is really the vision that nvidia has at least today for their
grace hopper architecture it’s essentially this there will be a cpu and a gpu connected to one another and the
cpu will do all the things that are memory intense and the gpu will do all the data intense things the thing about
the problem with this kind of a model is it’s a beautiful model by the way i’m not saying anything uh bad about this if
you if you really want to build a gpu world that’s a great thing to do but again
the you know how you how much you utilize your gpu your attached gpu
has to do with how you write your application because you need to move the data into the gpu in and out and that’s
slow right you remember it’s like it’s exactly like going to memory right it’s
the gpu is not up it’s not sitting in your in your caches so if you’re on the cpu and you’re computing something on a
cache and suddenly you get a page fault and you have to go and get something from memory that’s the latency that the
gpu introduces here right and so if if you’re going to design it with that
you have to create really good software to pipeline things and this is at the level of the application
so the application programmer has a big programming task and so this is a great
solution for large-scale big projects where okay
i’m gonna facebook is gonna get you know a thousand of these or ten thousand of these whatever it is you know uh or or
google ten thousand a hundred thousand of these and you put them together with then it’s worthwhile to write this kind
of complex software but if you’re joe company right and you have your little thing i don’t think you want to be
writing that interface right so so kind of so i’m saying it’s a it’s a it’s great
for large things right data center things big things but i’m very doubtful if this is
going to be um effective at the edge if you can
um actually utilize the cpu for it okay and and i will say one more thing
and that is that um you know that
the modern way that the designers of hardware think about it is that is mod it’s
built-in modules if you look at the if you look at the amd latest architecture right essentially you have these ccx’s
so so the machine even though it has you know maybe 40 or 50 or 60 cores
right they’re grouped into groups of eight right and each group of eight like this is a little piece of the die okay and i
think intel is shifting in that direction too so nothing’s to prevent you from making pieces of that die be
specialized pieces of hardware like a gpu you don’t have to have outside device so if you ask me what the future
is going to look like it’s probably going to look like you know these large cores right that have um or large
machines with with multiple dyes and on these dyes we might have a gpu die we might have accelerated and that’s more
like what i expect to happen rather than having a massive you know accelerator on
the side if we if if we hear sparsity and uh things not
being in layers and so on naturally the topic of i think graph neural networks is very close to that at least in the
imagination of people do you have anything to say about you know where current graph neural networks stand with
respect to sparsity yeah i would think of graph neural networks as a as a as a different kind
of okay so so graph neural networks i i use some some graphical networks in my
research and the and the idea there you know is that you know we can use graph neural
networks to solve graph problems that otherwise would be very complicated to
solve if we tried to solve in brute force okay now it’s not generally applicable
there are quite a few limitations but but as a tool i would say
that you know rather than think about the neural network itself as being looking like a graph neural network
right i could use graph neural networks right um to define um what we call motifs in
the neural network so for example when we try to look at at how brain struct
brains are structured right when we look at the graphs of brains and we try to understand you know is there a motif
that is repeating itself in this graph right then using a graph neural network for that is a really nice way to try to
find these motifs okay efficiently right um because the problem itself is
is p space complete or actually we don’t know it’s it’s a graph isomorphism so so
clearly we don’t know right how to do the brute force algorithm well but but the graph neural network can
come to our aid here and so so i would say that right now i don’t really see
a a real network design neural network design that is specific to that or a way that
it helps but but in research it definitely works and we really want to use these networks to help us in
research um this might be a bit of a tech bro
question but if i hear you know i can do sparse computation very i can reduce the
flops and so on um is there any intrinsic connection between the
sparsification of neural networks the non-layer-wise computation and blockchain technology and smart
contracts and distributed computing and things like this have you ever given this any thought or or
uh is that completely off yeah look i think nothing is completely off
with respect to maschine in the sense that i am sure that machine
learning will find its way into into all of those areas
right it’s a matter of time and um and right now right the all the work
there doesn’t need the efficiency of right of what machine learning offers
because machine learning in the end is an optimization and so when i think when all these blockchain
algorithms and all you know become more commonplace and we need to provide them
with things like security further security or analysis and so on i think then we’re going to see applications of
machine learning there and with that i think all these things of sparsity and so on
are going to are going to appear but you know but but for me right
it really is the whole story of sparsity right is the story of uh of a phenomenon that is very
prevalent in nature right that may you can say surprisingly or not
surprisingly shows up in machine learning and it kind of it makes me feel like it’s strengthening
my my belief right that even though the exact computations that we’re doing are
not the same as spiking neural networks and brains right that there is a lot of commonality there
and the emergence of these similar phenomena like sparsity like you know pruning and so on and the fact that we
can get benefits from it this tells me oh okay these are related i think that’s a very important
uh interesting point to keep in mind with neural magic who is your main
target audience like who who is listening to this do you want to let know like we are exactly for you so we
span the gamut from the data center to the edge um i would like to say i mean we just now
are moving into providing the same properties for arm architectures and so i would say the
exciting new thing at neuromagic is we’re moving from doing this you know uh for amd and intel architectures to doing
it for arm which means that we’re going to spam again all the way to the very bottom of the of the food chain if you
will and i think this is very exciting because as you know because because sparsity has a dual role as you go down
the food chain right because for the large accelerating you know the the fact that the memory footprint is large or
small is not that important but as i go down sparsity gives me two things speed with neural magic gives you speed but it
also makes the model extremely small so you’re getting a small accurate model right running on a very small device and
this you know typically is an arm device and so that’s that’s that’s the audience
that i’d like to say hey we’re coming you know we’re coming and we’re going to deliver the same things that we can deliver for intel and amd we’re now
going to deliver it for arm at the very end at the very edge if you say edge do you mean
smartphones do you mean security cameras do you mean robots everything okay
i mean everything i’m not not like i’m going to do everything to start with but but yes yes we’re aiming in that direction yes
and with the danger that this has become going to become like a a marketing opportunity question but how easy is it
to get started with what you’re doing like let’s say i’m a i’m i’m like i’ve
done you know my tensorflow tutorials i know how to build a model and train it and so on like
how much does it take for me to transition or to to apply what you’re doing
yeah so you just go to our website go to get go to get download deep sparse our
you know our engine download our uh ml tooling and um you know immediately you
just either pick a sparse model and transfer learn onto it with our tool so we have recipes you have a model you have a
recipe exactly what you would do if you went to hugging face and downloaded a model and downloaded a recipe you do the
same kind of thing and you sparse transfer learn onto it and you’re in business so
it’s not very hard so i think this is really and we’re working on making it even even easier
this is one of our goals right is to make it really really easy to do this and the advantage of course is that you
know people are already busy uh you know quantizing their models to
get more performance so this is like quantizing in some sense right you’re going to do the same kind of thing
and get yeah a lot more performance yeah is there a type of model where it works
particularly well and the type of model where it doesn’t like i’m thinking you know convnets recursive networks other
regressive maybe you know the big language models like what what is it best at
yeah so right now you know it’s best that at bird yolo models we do we do
computer vision and we do uh and we do the language models but not the large language models we haven’t done a large
language models yet so for those types of things like the birds and the yolos and the you know the uh whatever the
variants of efficient nets and all these guys this is you know visual transformers these are the things that
that we do right now and and every all our technology is right now
you know available for those um i’d love to do the large models a cpu is a natural environment
for running the knowledge models you know these giant models these trillion or whatever parameter
models that people talk about splitting across 16 gpus they fit on your desktop
okay so clearly a cpu is a natural place to run a very large model okay and
so that’s that will be a target but right but not right now okay very exciting uh is there any last
things you want to get out maybe about neural magic or sparsity in general well you know our our whole machine learning
software stack is open source and we’d love people to come in and help us build
you know better sparsity use sparsity in their models and and tell us about what
they’re doing and you know that it would we have a community and we’d love you to join our community
excellent near thank you so much for being here today this was very pleasant
thank you very much bye-bye bye-bye [Music]