Takeaways From 3 Years Working In Machine Learning

Link post

Disclaimer: Feeling so-and-so about posting this on LW, but given how many people here work in ML or adjacent fields I might as well.

After 3 years working on automl at Mindsdb, I quit; I won’t be working on anything ml-related in a professional capacity, at least for a short while.

I am in the uncanny valley of finally understanding what I don’t know, and maybe even having a feel for what nobody knows.

I might as well write a summary of my takeaways, in hopes of them being useful to someone, or maybe just as a ritualistic gesture of moving on, a conceptual-space funeral.

Please don’t take this as an “expert summary”, there are dozens of thousands of people more suited than myself for doing that. Instead, think of it as a piece of outsider art, takeaways from someone that took an unusually deep dive into the zeitgeist without becoming part of “the community”.

i—What Is The Role Of Research?

The role of research into machine learning is, half a decade after first pondering it, still a mystery to me.

Most scientific sub-fields (the real ones) can claim they are a dual process of theory-building and data-gathering. Even if the standard model ends up being replaced, the data leading to its replacement will include the same set of experimental observations that lead to it being built. Even if the way we conceptualize the structure of DNA and the idea of genetics changes to better fit many-tissued eukaryotes, the observations these new concepts will have to fit shall remain unchanged.

In more theoretical areas, such as those revolving around the terms mathematics and computer science, the gains are at an almost purely-conceptual level, but they hold fast because the concepts are so foundational they seem unlikely to be replaced. Some alien species living inside deep space that we could barely understand as “life”, or some space-faring empire or languageless primate tribes that we could barely classify as “humans” might build impressive conceptual machinery without the alonzo-church thesis or even without concepts such as function, number, discreteness or continuity. But as long as those things hold true, a lot of old ideas in mathematics and computer science remain at most inefficiently phrased, but broadly valid.

But machine learning is at an odd intersection. Even if we grant it has as much theoretical rigor as, say, physics, it still lacks the timeless experimental observations. This is not for lack of any virtue in participants, it’s just that the object of study is a moving target, rather than a concrete reality. At most one can make timeless claims like:

Given <such and such hardware> it’s possible to get <such and such accuracy> on <such and such split> of ImageNet

Given <X> dataset and using whole-dataset leave-one-out CV we can achieve x/​y/​z state of the art values for certain accuracy functions.

Given <Y> environment that allows a programmable agent to interact with it, we can reach x/​y/​z points on a cycles/​punishment/​time/​observations & reward/​knowledge/​understanding matrix.

But these are not the kind of claims theory revolves itself around, the easily verifiable and timeless takeaways from machine learning are theoretically uninteresting. At most one can say they place a lower bound on the performance of digital hardware on certain human-specific tasks.

Nor is it backed by strong theoretical guarantees, at least not without reaching “increased milk production if you assume cows are perfect spheres in a void” type absurdity. There are “small” theoretical guarantees that can help us experiment more broadly (e.g. proving differentiability within certain bounds). There are “impressive” theoretical guarantees that hold under idealized conditions which can at most point towards potential paths to experiment.


You might assume that this leads to a very fuzzy field, where everything goes, most data is fake, and most papers are published to gain wins in a broader academic game of citation.

In truth, however, machine learning is probably my go-to example for the only realm where academia functions correctly. The norm is to ship papers with code and data, as well as sufficient methodological rigor in order to facilitate easy replications. The claims made by the paper are usually fairly easy to prove and reach independently with the tools the authors provide. While there are always exceptions, and, as a rule, the exceptions usually garner a lot of attention for good and bad reasons, they are few.

Even more importantly, replication in machine learning, unlike the realm of mathematics and computer science, is not left to a dozen of experts with a lot of time to spare. If you want to validate the claim of a state-of-the-art NLP paper you just need CS101 knowledge to do it. You may not understand why it works, but it should be sufficient to validate that it does indeed work. This can’t be said about fields such as mathematics, where the validity of a modern “theorem” (sometimes coming at the length of a hefty book) is decided by consensus among a handful of experts, rather than an automatic theorem prover.

Indeed, a lot of practical work in machine learning has been done by outsiders, with unrelated, little or no academic background. Most of the work is done by people that followed a life-long academic + career track, but this is the norm in any field. However, no barriers are placed to those that haven’t, you can submit your papers to conferences, you can debate ideas in online communities, companies can legally hire you to work on their projects. We should not forget this isn’t the norm in almost any other field, where credentialism is queen and her parliament a gerontocracy.

ii—What Software Does that Hardware Can’t

Machine learning research, even in a broad sense that includes working on LA libraries or clustering tools, seemed always prone to relearn the bitter lesson.

Asking something like: “how much of the advances in ML from the 70s is software and how much is harder?” is answered with a surprisingly small amount of papers trying to look into something like this question. I get why, everybody knows that trying to train a giant decision tree at scale with some sort of T5-equivalent compute will not match T5 in any language task, and the task of adapting the algorithm is non-trivial. The question of whether or not this yields something that is 1 million times worst or just 100 times worst though, is an interesting one, but it’s one of many important questions that no research incentive gradient will generate answers to.

But the question itself is kind of moot, of course, some research is always needed to have software that best fits the advances in hardware. A better question might be “where the amount of research to decrease by 1000x, would that have any consequences on performance or breadth of tasks that can be tackled”, and it’s reverse.

It certainly seems like the major modern techniques in machine learning can be claimed as slight re-conceptualizations of ideas 20, 30, or even 50 years older.


My intuition tells me that most advances that will come, will be a result of hardware, and the high-impact activities for those of us that prefer focusing on software are

a) Figuring out what hardware advances will allow one to do in 2-4 years, seeking funding, and building a company around those opportunities ready to pounce, then executing (or filling in the VC-side of that couple)

b) Trying to look for paradigm shifts, fixing the bottlenecks wasting 99% of resources that are so deeply ingrained into our thinking we can’t even see them as such. Figuring out that 99% of the available compute capacity is /​actually/​ not being used and could be with this one cool trick. Coming up with a simple abstraction that single-handedly performs close to SOTA on any available task.

iii—Automatic Machine Learning

A large part of my work was around automatic machine learning, so I’m obviously biased about thinking it’s a big deal, but I do think it’s a big deal.

The vast majority of people working in ML, be it academia or industry, seem to be working on tasks that are on the edge of automation.

Many papers seem to boil down to:

  1. Take architecture

  2. Make some small modifications to hyperparameters; a term I use very broadly, to include things like e.g. the shape of a network, boosting algorithm and type of weak models, activation function formula, and so on.

  3. Run benchmarks on a few datasets

  4. Prove some theoretical guarantees which usually won’t apply to any real-world training scenario and could have been empirically proven (e.g. differentiability, uniform convergence when data fits some idealized distribution)

  5. Add sufficient filler

Many data scientists and ml engineers seem to do something akin to:

  1. Try a few easy to use models, the kinds that need < 100 lines of code to use if set up properly

  2. Tweak hyperparameters unit accuracy on the test data is good enough that even if it’s somewhat worst in reality it’ll still be worth it to deploy to production

  3. Wrap it in some sort of API for the backend to consume

  4. If needed, write a cron to retrain it every now and then

  5. Compose and present a very long PowerPoint to 5 people with capital “P” and maybe some “C” in their job titles so that they feel comfortable allowing you to deploy


Broadly speaking that seems like the kind of thing that’s really easy to automate. Actually, that seems like the kind of thing that’s been already been mostly automated, and although open-source has lagged behind “as a service” solutions, it’s certainly there now with things like hugging-face and h2oautoml, not to mention “runner ups” like the library I worked on or the 100 others playing in roughly the same mud-puddle.

Maybe the job of those people didn’t have anything to do with steps 1-4 in the first place, maybe fuzzy theory and power points presented by a soothing human voice saying “we understand what’s going on” is the important bit.

Maybe that’s a reductionist way to think about it, and to be fair if automl is so good, why aren’t more Kaggle leaderboards dominated by people using such packages? Speaking of which...

iv—Benchmarks And Competition

Compared to publication volume, interest in benchmarking and competition seems scarce. The volume of ml-related papers appearing on arxiv every day is well above 100, the amount that makes it onto a papers-with-code leaderboard is much smaller.

I think the way most researchers would justify this is that they aren’t trying to “compete” in anything with their technique, they aren’t trying to raise some sort of accuracy score, rather they are trying to provide interesting theory-backed directions for designing and thinking about models.

This in itself is fine, except for the fact that I can’t point to a single ground-breaking technique that was based solely on mathematical guarantees and took years to mature, it usually seems to be that if something “works” and gets wide adoption, this is because it improves results right away.

The few breakthroughs that took years (dozens of years?) to materialize are broad-reaching architecture ideas, but those are few and far between.

So we are left in a situation where people postulate “generic” techniques on papers, such as an optimizer or a boosting method, hand wave a few formulas, then end up benchmarking them with small-to-no variation in the model-dependent variable (e.g. architecture an optimizer is optimizing on, estimator a boosting algo is using, etc) on less than a dozen datasets. This is not a critique of 3rd rate papers either, off the top of my head I can name things like LightGBM, rectified ADAM and lookahead, which for me and many other people have been game-changers that have proven their worth of many real-world problems, being originally presented in papers with barely any experiments.

The problem boils down to:

Lack of “generic” benchmark suites. The OpenML automl benchmark comes closest here, but its problem focus is very narrow and it’s limited to testing end-to-end automl solutions. An ideal generic benchmark would have a many-to-many architecture-to-dataset mapping which permits for certain components to be swapped in order to evaluate new techniques as part of a greater whole. At some point, I had a fantasy to build up the Mindsdb benchmark suite into this, but I doubt anyone really wants this kind of solution, the incentive structure isn’t there.

Lack of competition. I mean, there are websites like Kaggle and a dozen industry-specific clones, but their formats place a lot of demands upon users and the competitions award miserly prizes.

Potentially a compounding of the two things above is the fact that most “worthwhile” problems in machine learning are hard to even benchmark or compete on. Tasks such as translation, text embedding generation, and self-driving are various levels of hard-to-impossible to judge objectively in a way that detects anything but the largest of jumps.

This loops back onto the view that, if you’re working in the world of code, you might as well focus on paradigm shifts or productionization, unless you are explicitly being paid to do otherwise.

v—Using The State Of The Art

Another interesting thing to ask here is whether or not ‘state of the art’ has been reached in certain areas. It seems like the interest around “classical” supervised problems, the kind that can be contained within a .csv and evaluated with an accuracy function that goes from 0 to 1, has shifted a lot into areas of speed, mathematical guarantees, and “explainability”.

That being said, it’s impossible to answer this question in a definitive fashion.

What I can answer in an almost definitive fashion, is that everyone from academic researchers, to industry researchers to your run-of-the-mill mid-size-company data scientists, seem completely uninterested in the idea of getting state-of-the-art results.

I was (un)lucky enough to speak with dozens of organizations about their ml practices, and my impression is that most organizations and projects that “want to use ml” are not even at the “data-driven” stage where ml makes sense. They want to start with the conclusions, produce predictions out of thin air, and are flabbergasted by the very idea of evaluating whether an algorithm is good enough to use in production.

Almost 30 years ago, a doctor published a paper where he reinvented 6th-grade math while trying to figure out how to evaluate his diabetic patients. This happened at a time when personal computers were a thing, and one would have assumed life-and-death decisions requiring standardized calculations would surely be done by a machine, not someone that hadn’t even heard of calculus. Even worst, given that we are talking about someone that actually published papers, this is where the 0.1% leading vanguard of the field was at the time, God only knows what the rest were doing.

I get a feeling that whatever the broader phenomenon described by this problem is, it’s still the root cause for the lack of impact ML has upon other fields. It’s very unlikely that improving accuracy on certain problems by a round error, or more theoretical guarantees about whether or not an algorithm is within 0.3% of the perfect solution or parameter pruning for easier explainability will help.

It seems to me that most of the problems with people using classical ml is people, and no amount of research will fix that.

vi—Alien Brains

On the other hand, ml applied to “nonclassical” problems is showing more and more promise, be those problems language or driving. Within that realm, it seems that distinctions between supervised and unsupervised drop away, and trying to explain an algorithm as simple math, rather than as a generative system self-selecting based on constraints, is becoming about as silly as doing this with a brain.

The grandious view here is that enough clout in enough direction leads to heavily specialized approaches and algorithms that can be transferred between researchers (either as files or as services) to serve as the basis for building higher-level functionality. That in 30 years machine learning will seem more like tending to gigantic and impossibly-complex alien brains that control vast swaths of society rather than doing linear algebra.

The skeptic view is that one could write GPT-{x} from scratch without even autograd given one or two thousand lines of code, and most of the effort is in the details, in code to parallelize and easily experiment, in tricks to bring about digit percent improvements. Furthermore, the lack of ability to evaluate complex tasks objectively will sooner or later lead to stagnation, almost by definition.

Nuclear energy was so incredible that it seemed like we had an inch of the last mile to go til flying cars, I don’t think the situation with ml and self-driving is similar, but there should seem to have passed a whole lot of time with not a lot of self-driving, and that problem is not among the hardest.

I’m not sure either narrative is correct, and to me, it still seems like explaining ml starting with multivariant regressions is better than predictive coding and game theory, which is placing a better foundation. But I’ve certainly come to be more impressed at what can be achieved, while simultaneously becoming more skeptical that progress will inch forward with a similar slope.

vii—Schisms And Silos

As it stands I see three interesting and somewhat distinct directions under the umbrella of machine learning that are inching further apart.

You’ve got “classical” ml approaches that now have enough compute in order to tackle most high-but-not-millions-high dimensional problems. Where the core problem is one of providing more theoretical guarantees, generating the kind of “explainability” required to be the basis of “causal” models, and steering the zeitgeist away from straight lines, p-values, and platonic shapes.

Then you have “industry adoption”, which is where most people with “machine learning” and “science” in their titles are actually working. I think this revolves a lot more into typical “automation” work, that is data wrangling, domain logic understanding, and politicking. It’s just that the new wave of automation is now, as ever, backed by fancier tooling.

Finally, you have the more gilded-halls type research which, as ever, is staffed by a few idealists thinking about thinking and a lot of lost students trying to paper-ping-pong their way into the tenure of the previously mentioned career track. This is what yields the most interesting discoveries, but they are hidden in piles of unactionable or low-impact noise. I’m not sure what’s happening beyond the wide-open doors, since I’m not good enough to filter the noise until years after my betters already have; But prima-facia it does seem like abstractions are shifting away in what would have previously been considered the realm of reinforcement learning.

At the unions between any two of the three, you’ve got the interesting stuff. AlphaFold is a last-minute-advancements in transformers joined with the scientific domain expertise to replace “manual” models for protein folding. Tesla autopilot is SOTA vision, RL, and transfer learning joining forces with lobbyists, lawyers, and marketers to automate away double-digit percentages of jobs. Some people working on the replication crisis seem to be at the union between the first two, trying to systematize away human error in analyzing data and reviewing evidence, though I think I’d be calling this one a decade too early.

This is an imperfect categorization, but it does help me wrap my head around the whole field. I think it’s also a good paradigm to take when thinking about what problems to work on, with whom, and what background will be necessary to do so.