In this work we revisit the most fundamental building block in deep learning, the multi-layer perceptron (MLP), and study the limits of its performance on vision tasks. Empirical insights into MLPs are important for multiple reasons. (1) Given the recent narrative “less inductive bias is better”, popularized due to transformers eclipsing convolutional models, it is natural to explore the limits of this hypothesis. To that end, MLPs offer an ideal test bed, being completely free of any inductive bias. (2) MLPs have almost exclusively been the main protagonist in the deep learning theory literature due to their mathematical simplicity, serving as a proxy to explain empirical phenomena observed for more complex architectures. Surprisingly, experimental datapoints for MLPs are very difficult to find in the literature, especially when coupled with large pre-training protocols. This discrepancy between practice and theory is worrying: Do MLPs reflect the empirical advances exhibited by practical models? Or do theorists need to rethink the role of MLPs as a proxy? We provide insights into both these aspects.
We show that the performance of MLPs drastically improves with scale (93% on CIFAR10, 79% on CIFAR100, 69% on TinyImageNet), highlighting that lack of inductive bias can indeed be compensated. We observe that MLPs mimic the behaviour of their modern counterparts faithfully, with some components in the learning setting however surprisingly exhibiting stronger or unexpected behaviours.
Due to their inherent computational efficiency, large pre-training experiments become more accessible for academic researchers. All of our experiments were run on a single GPU.
Much like the others, MLPs just need more regularization and/or data because of their greater capacity/less hardwired inductive bias. (By the way, note that they investigate Chinchilla scaling to see if 1:1 is optimal like for Transformers; it is not, and MLPs require more data per parameter because they are more powerful. Good news for MLPs...)
I disagree with that model...or at least think you’re taking it too far, misapplying heuristics you developed from seeing eg AlphaZero beating hardcoded chess heuristics.
Oh no, there are many more examples than ‘just’ ViTs and MuZero (and machine translation). ‘The curves cross’ goes back at least to Highleyman in the 1960s with nearest-neighbors image classification, and you could add the original Breiman tabular ML movement in the 1990s which focused on beating logistic regression. (I would also highlight my prediction that despite an almost unquestioned consensus post-diffusion models asserting GANs had failed due to intrinsic and possibly unfixable flaws, that they woulddofineif scaled up.)
and “textbooks are all you need” is then evidence for more structural complexity being helpful.
This isn’t a point in favor of architectures and ‘structural complexity’, but data. The question, of course, is to what extent it’s the right data and is not building in covertly (the way so many past failed methods do) expert hand-engineered priors/inductive-biases which earns buzz-clicks-cites right now but will ultimately hold back models compared to just ‘learn all the things!’ approaches to base models. It’s never really worked before, but people greet every new ‘our model beats GPT-3 with 1/100th the parameters’ research paper as the messiah—not that you remember any of those from 2020, or 2021, and 2022-2023 aren’t looking so hot either...
(It’s worth remembering that people, and especially academics, want it to be true that small cheap complicated models+data+algorithms, which require loving human expertise and curation and lots of published papers while costing little $, can solve all their problems; there are very few people who genuinely prefer large expensive simple models with hilariously-dumb architectures but which cost $$$. It is an acquired taste, to be sure. Like being kicked in the nuts. “Please sir, Mr Bitter Lesson, may I have another lesson that renders 2 more years of my life a complete waste of time?”)
do you consider such a data → evaluation → training structure “inductive bias obviated by scaling”?
Yes. Consider how an arch like MuZero turns compute into performance: it does so via data. Or consider what you spend extra compute on, as Bottou has been pointing out since the mid-2000s: you spend it on processing more data. SGD go brrrr.
Karpathy seemed to understand the issues.
You linked a 2023 tweet by Karpathy who has been reading me rant regularly about BPEs for at least 3 years now (note the first reply, BTW), and I’ve even discussed it with him in person! He didn’t think that before I did, so your example shows the opposite of what you take it to mean.
Maybe I missed an earlier paper, but the Performer design is completely different.
My point there is that all of the linear-attention Transformer variants seem to fail in some way and show bad scaling, like Performer does when explicitly investigated, and I expect the rest to also show bad scaling because none of them dominate Performer.
I do think byte encoding works well with MegaByte type designs—as people have already found in testing.
Byte encoding works well with non MegaByte type designs too...
this just dropped on Arxiv: “Scaling MLPs: A Tale of Inductive Bias”
The “baseline” comparisons in that paper are pretty funny; they messed with existing models in such a dumb way I think it was intentional. Anyway, a normal EfficientNet is ~2x as FLOP-efficient as their highlighted MLP example, it can reach higher accuracy more easily, and its advantage gets bigger with higher resolution than 64x64. Do you read the stuff you link to?
there are very few people who genuinely prefer large expensive simple models with hilariously-dumb architectures but which cost $$$
Maybe that’s true among engineers, but corporate directors and other people with lots of money definitely prefer that. And I also seem to see a decent number of people on twitter who feel some sense of superiority from “accepting the Bitter Lesson” harder than the next person. Then there are some people who just don’t want to think about design anymore and consider that an excuse to stop thinking about it...which I guess I’m fine with. Yeah, you do that.
You linked a 2023 tweet by Karpathy who has been reading me rant regularly about BPEs for at least 3 years now (note the first reply, BTW), and I’ve even discussed it with him in person!
I see; somehow I thought he was smarter than that. The few people at big AI labs I knew all understood those issues, but there’s obvious selection bias there.
My point there is that all of the linear-attention Transformer variants seem to fail in some way and show bad scaling, like Performer does when explicitly investigated
That is fundamentally different from what MegaByte does. It’s not a linear-attention scheme. I’m not sure why you see this as relevant.
Anyway, a normal EfficientNet is ~2x as FLOP-efficient as their highlighted MLP example, it can reach higher accuracy more easily, and its advantage gets bigger with higher resolution than 64x64. Do you read the stuff you link to?
You are again missing the point: as I already explained, we expect the constant to be worse. (If the constant was better in addition to a similar exponent, we would not be debating ‘will MLPs someday replace Transformers’, because it would already be happening.)
I will also point out that there is a very big difference between ‘it literally doesn’t work, everyone’s tried it and it doesn’t work, it doesn’t work so badly they won’t even publish anything I can cite to prove that MLPs are idiotic and have zero chance of ever going anywhere and and and -’ and ‘OK fine yeah they work and scale smoothly in the usual way but look this paper’s basic implementation is slower & lower accuracy at small scale than good baselines from NNs with more hardwired inductive biases zomg do you even read the stuff you link’. (That whoosh you hear is the much-abused goalposts moving at the speed of submitting a comment.)
Anyway, I’ve fleshed out one idea I have of how a scaled-up MLP architecture could surpass Transformers.
Maybe that’s true among engineers, but corporate directors and other people with lots of money definitely prefer that. And I also seem to see a decent number of people on twitter who feel some sense of superiority from “accepting the Bitter Lesson” harder than the next person.
You are seeing a selected sample. Don’t look at your social media feed, look at what people do. Look at the allocations of supercomputer time. Look at a single day of Arxiv papers (not the AKs, the actual site). There is not much research that takes it seriously or does research that will have long-term value, like running scaling law sweeps; it’s almost all stuff which relies on a fundamental denial of scaling & Bitter-Lesson-like logic, full of special-case tweaking or proving irrelevancies or creating a complicated architecture which saves a small constant-factor etc. Look at the Best Paper awards. Look at the surveys where the overwhelming majority deny scaling works even in principle (while also, interestingly, believing they are the brave minority, which they are not).
I see; somehow I thought he was smarter than that.
(Wow, way to just throw Karpathy under the bus.)
The few people at big AI labs I knew all understood those issues, but there’s obvious selection bias there.
I wonder if anyone other than Karpathy has been reading my observations about BPEs all these years...
That is fundamentally different from what MegaByte does. It’s not a linear-attention scheme. I’m not sure why you see this as relevant.
It’s not strictly linear in memory, but it’s doing the same thing of reducing the quadratic by some degree using a local->global hierarchy, as so many attention variants have done, and could reduce it to effectively linear (since stuff like nlogn are effectively just n). So MegaByte is stuck: either it continues being mostly vanilla over fixed-sized patches and eventually that sub-quadratic becomes expensive as the global sequence length & model must scale up, or it reduces the growth to linear by expanding the hierarchy further (adding in additional layers of ‘patches’) and just becomes another variant of local->global like Swin Transformer etc. (I thought they should’ve done a better job contextualizing it, in addition to benchmarking on LRA.) Maybe MegaByte somehow threads the needle just right with its particular set of tweaks, and will be the attention variant which finally cracks the nut, but I’m not too optimistic. /shrug. What would get my attention is wins on LRA benchmarking or exponents in scaling law sweeps like Tay. We’ll see.
Speaking of MLPs and how supposedly they don’t scale & silently fail whenever anyone tries, this just dropped on Arxiv: “Scaling MLPs: A Tale of Inductive Bias”, Bachmann et al 2023 (excerpts)
Much like the others, MLPs just need more regularization and/or data because of their greater capacity/less hardwired inductive bias. (By the way, note that they investigate Chinchilla scaling to see if 1:1 is optimal like for Transformers; it is not, and MLPs require more data per parameter because they are more powerful. Good news for MLPs...)
Oh no, there are many more examples than ‘just’ ViTs and MuZero (and machine translation). ‘The curves cross’ goes back at least to Highleyman in the 1960s with nearest-neighbors image classification, and you could add the original Breiman tabular ML movement in the 1990s which focused on beating logistic regression. (I would also highlight my prediction that despite an almost unquestioned consensus post-diffusion models asserting GANs had failed due to intrinsic and possibly unfixable flaws, that they would do fine if scaled up.)
This isn’t a point in favor of architectures and ‘structural complexity’, but data. The question, of course, is to what extent it’s the right data and is not building in covertly (the way so many past failed methods do) expert hand-engineered priors/inductive-biases which earns buzz-clicks-cites right now but will ultimately hold back models compared to just ‘learn all the things!’ approaches to base models. It’s never really worked before, but people greet every new ‘our model beats GPT-3 with 1/100th the parameters’ research paper as the messiah—not that you remember any of those from 2020, or 2021, and 2022-2023 aren’t looking so hot either...
(It’s worth remembering that people, and especially academics, want it to be true that small cheap complicated models+data+algorithms, which require loving human expertise and curation and lots of published papers while costing little $, can solve all their problems; there are very few people who genuinely prefer large expensive simple models with hilariously-dumb architectures but which cost $$$. It is an acquired taste, to be sure. Like being kicked in the nuts. “Please sir, Mr Bitter Lesson, may I have another lesson that renders 2 more years of my life a complete waste of time?”)
Yes. Consider how an arch like MuZero turns compute into performance: it does so via data. Or consider what you spend extra compute on, as Bottou has been pointing out since the mid-2000s: you spend it on processing more data. SGD go brrrr.
You linked a 2023 tweet by Karpathy who has been reading me rant regularly about BPEs for at least 3 years now (note the first reply, BTW), and I’ve even discussed it with him in person! He didn’t think that before I did, so your example shows the opposite of what you take it to mean.
My point there is that all of the linear-attention Transformer variants seem to fail in some way and show bad scaling, like Performer does when explicitly investigated, and I expect the rest to also show bad scaling because none of them dominate Performer.
Byte encoding works well with non MegaByte type designs too...
The “baseline” comparisons in that paper are pretty funny; they messed with existing models in such a dumb way I think it was intentional. Anyway, a normal EfficientNet is ~2x as FLOP-efficient as their highlighted MLP example, it can reach higher accuracy more easily, and its advantage gets bigger with higher resolution than 64x64. Do you read the stuff you link to?
Maybe that’s true among engineers, but corporate directors and other people with lots of money definitely prefer that. And I also seem to see a decent number of people on twitter who feel some sense of superiority from “accepting the Bitter Lesson” harder than the next person. Then there are some people who just don’t want to think about design anymore and consider that an excuse to stop thinking about it...which I guess I’m fine with. Yeah, you do that.
I see; somehow I thought he was smarter than that. The few people at big AI labs I knew all understood those issues, but there’s obvious selection bias there.
That is fundamentally different from what MegaByte does. It’s not a linear-attention scheme. I’m not sure why you see this as relevant.
You are again missing the point: as I already explained, we expect the constant to be worse. (If the constant was better in addition to a similar exponent, we would not be debating ‘will MLPs someday replace Transformers’, because it would already be happening.)
I will also point out that there is a very big difference between ‘it literally doesn’t work, everyone’s tried it and it doesn’t work, it doesn’t work so badly they won’t even publish anything I can cite to prove that MLPs are idiotic and have zero chance of ever going anywhere and and and -’ and ‘OK fine yeah they work and scale smoothly in the usual way but look this paper’s basic implementation is slower & lower accuracy at small scale than good baselines from NNs with more hardwired inductive biases zomg do you even read the stuff you link’. (That whoosh you hear is the much-abused goalposts moving at the speed of submitting a comment.)
Anyway, I’ve fleshed out one idea I have of how a scaled-up MLP architecture could surpass Transformers.
You are seeing a selected sample. Don’t look at your social media feed, look at what people do. Look at the allocations of supercomputer time. Look at a single day of Arxiv papers (not the AKs, the actual site). There is not much research that takes it seriously or does research that will have long-term value, like running scaling law sweeps; it’s almost all stuff which relies on a fundamental denial of scaling & Bitter-Lesson-like logic, full of special-case tweaking or proving irrelevancies or creating a complicated architecture which saves a small constant-factor etc. Look at the Best Paper awards. Look at the surveys where the overwhelming majority deny scaling works even in principle (while also, interestingly, believing they are the brave minority, which they are not).
(Wow, way to just throw Karpathy under the bus.)
I wonder if anyone other than Karpathy has been reading my observations about BPEs all these years...
It’s not strictly linear in memory, but it’s doing the same thing of reducing the quadratic by some degree using a local->global hierarchy, as so many attention variants have done, and could reduce it to effectively linear (since stuff like nlogn are effectively just n). So MegaByte is stuck: either it continues being mostly vanilla over fixed-sized patches and eventually that sub-quadratic becomes expensive as the global sequence length & model must scale up, or it reduces the growth to linear by expanding the hierarchy further (adding in additional layers of ‘patches’) and just becomes another variant of local->global like Swin Transformer etc. (I thought they should’ve done a better job contextualizing it, in addition to benchmarking on LRA.) Maybe MegaByte somehow threads the needle just right with its particular set of tweaks, and will be the attention variant which finally cracks the nut, but I’m not too optimistic. /shrug. What would get my attention is wins on LRA benchmarking or exponents in scaling law sweeps like Tay. We’ll see.
Have fun playing with MLPs. I’m not trying to stop you, I’m just stating my position for audience members who understand it.