I wonder at which point we’ll start seeing LLM-on-a-chip.
One big reason for the current ML/AI systems inefficiencies is just abstraction layering overhead, our pay for the flexibility. We currently run hardware that runs binary calculations that run software that run other software that runs other software (many many layers here, OS/drivers/programming language stacks/NN frameworks etc) that finally runs the part we’re actually interested in—bunch of matrix calculations representing the neural network. If we collapse all the unnecessary layers between, burning the calculations directly to hardware, running particular model should be extremely fast and cheap.
Also, arguably, Groq’s dataflow architecture is more or less this and there wouldn’t be too much difference with Cerebras either for an on-chip NN. The problem is, the control flow you refer to has largely already been removed from GPU/TPU style accelerators and so the gains may not be that great. (The Etched.ai performance argument is not really about ‘removing unnecessary layers’, because layers like the OS/programming-language etc are already irrelevant, so much as it is about running the models in an entirely different sort of way that batches more efficiently the necessary layers, as I understand it.)
Modern datacenter GPUs are basically the optimal compromise between this and still retaining enough general capacity to work with different architectures, training procedures, etc. The benefits of locking in a specific model at the hardware level would be extremely marginal compared to the downsides.
I wonder at which point we’ll start seeing LLM-on-a-chip.
One big reason for the current ML/AI systems inefficiencies is just abstraction layering overhead, our pay for the flexibility. We currently run hardware that runs binary calculations that run software that run other software that runs other software (many many layers here, OS/drivers/programming language stacks/NN frameworks etc) that finally runs the part we’re actually interested in—bunch of matrix calculations representing the neural network. If we collapse all the unnecessary layers between, burning the calculations directly to hardware, running particular model should be extremely fast and cheap.
That’s Etched.ai.
Also, arguably, Groq’s dataflow architecture is more or less this and there wouldn’t be too much difference with Cerebras either for an on-chip NN. The problem is, the control flow you refer to has largely already been removed from GPU/TPU style accelerators and so the gains may not be that great. (The Etched.ai performance argument is not really about ‘removing unnecessary layers’, because layers like the OS/programming-language etc are already irrelevant, so much as it is about running the models in an entirely different sort of way that batches more efficiently the necessary layers, as I understand it.)
Modern datacenter GPUs are basically the optimal compromise between this and still retaining enough general capacity to work with different architectures, training procedures, etc. The benefits of locking in a specific model at the hardware level would be extremely marginal compared to the downsides.