The ‘Bitter Lesson’ is Wrong

There are serious problems with the idea of the ‘Bitter Lesson’ in AI. In most cases, things other than scale prove to be extremely useful for a time, and then are promptly abandoned as soon as scaling reaches their level, when they could just as easily combine the two, and still get better performance. Hybrid algorithms for all sorts of things are good in the real world.


For instance, in computer science, quicksort is easily the most common sorting algorithm, who uses a pure quicksort? Instead they add on an algorithm that changes the base case, or handles lists with small numbers of entries, and so on. People could have learned the lesson that quicksort is just better than these small-list algorithms once you reach any significant size, but that would have prevented improving quicksort.


Another, unrelated personal example. When I started listening to Korean music, it bothered me that I couldn’t understand what they were singing, so on some of my favorite songs, I searched fairly significantly for translations.


When I couldn’t find translations for some songs, I decided to translate them myself. I didn’t know more than a phrase or two of Korean at the time, so I gave them to an AI (in this case, Google Translate, which had already transitioned to deep learning methods at the time.)
Google Translate’s translations were of unacceptably low quality of course, so I used it as a word or short phrase reference. Crucially, I didn’t know Korean grammar at the time either (which is in a different order than English.) It took me a long time to translate those first songs, but not anywhere near enough to be much training on the matter.


So how did my translations of a language I didn’t know compare to the DL used by Google? It was as different as night and day in favor of my translation. It’s not because I’m special, but because I used those methods people discarded in favor of ‘just scale’. What’s a noun? What’s a verb? How do concepts fit together? Crucially, what kind of thing would a person be trying to say in a song, and how would they say it? Plus, checking that backwards, if they were trying to say this thing, would they have said it in that way? I do admit to a bit of prompt-engineering even though I didn’t know what that was at the time, but that means I knew that a search through similar inputs would give a better result instinctively.


I have since improved massively by learning how grammar works in Korean (which I started learning 8 months later). By learning more about the concepts used in English grammar. What the words mean. Etc. Simple practice too, of course. Leveled off after a bit, but it started improving again when I consulted an actual dictionary; why don’t we give these LLM’s real dictionaries? But I could improve on it without any of that by using simple concepts and approaches we refuse to supply these AIs with.


Google Translate has since improved quite a lot, but is still far behind even those first translations where all I added were the simpler than even grammar structure of language and thought, along with actual grounding in English.. Notably, that’s despite the fact that Google Translate is far ahead of more general models like GPT-J. Sure, my scale dwarfs them, obviously, but I was trained with almost no Korean data at all, but because I had these concepts, and a reference, I could do much better.


Even the example of a thing like AlphaGo wasn’t so much a triumph of deep-learning over everything, as, if you combine insights like search algorithms (Monte Carlo tree search) with trained heuristics from countless things, it goes much better.


The real bitter lesson is that we give up on improvements out of some misguided purity.