When to assume neural networks can solve a problem

Link post

Note: the original article has been split into two since I think the two points were only vaguely related, I will leave it as is here, since I’d rather not re-post stuff and I think the audience on LW might see the “link” between the two separate ideas presented here.

A pragmatic guide

Let’s begin with a gentle introduction in to the field of AI risk - possibly unrelated to the broader topic, but it’s what motivated me to write about the matter; it’s also a worthwhile perspective to start the discussion from. I hope for this article to be part musing on what we should assume machine learning can do and why we’d make those assumptions, part reference guide for “when not to be amazed that a neural network can do something”.

The Various hues of AI risk

I’ve often had a bone to pick against “AI risk” or, as I’ve referred to it, “AI alarmism”. When evaluating AI risk, there are multiple views on the location of the threat and the perceived warning signs.

1. The Bostromian position

I would call one of these viewpoints the “Bostromian position”, which seems to be mainly promoted by MIRI, philosophers like Nick Bostrom and on forums such as AI Alignment.

It’s hard to summarize without apparently straw-man arguments, e.g. “AIX + Moore’s law means that all powerful superhuman intelligence is dangerous, inevitable and close.” That’s partly because I’ve never seen a consistent top-to-bottom reasoning for it. Its proponents always seem to start by assuming things which I wouldn’t hold as given about the ease of data collection, the cost of computing power, the usefulness of intelligence.

I’ve tried to argue against this position, the summary of my view can probably be found in “Artificial general intelligence is here, and it’s useless”. Whilst—for the reasons mentioned there—I don’t see it as particularly stable, I think it’s not fundamentally flawed; I could see myself arguing pro or con.

2. The Standard Position

Advocated by people ranging from my friends, to politicians, to respectable academics, to CEOs of large tech companies. It is perhaps best summarized in Stuart Russell’s book Human Compatible: Artificial Intelligence and the Problem of Control.

This viewpoint is mainly based around real-world use cases for AI (where AI can be understood as “machine learning”). People adopting this perspective are not wrong in being worried, but rather in being worried about the wrong thing.

It’s wrong to be upset by Facebook or Youtube using an algorithm to control and understand user preferences and blaming it on “AI”, rather than on people not being educated enough to use TOR, install a tracking blocker, use Ublock Origin and not center their entire life around conspiracy videos in their youtube feed or in anti-vaccination facebook groups.

It’s wrong to be alarmed by Amazon making people impulse-buy via a better understanding of their preferences , and thus getting them into inescapable debt, rather than by the legality of providing unethically leveraged debt so easily.

It’s wrong to fuss about automated trading being able to cause sudden large dips in the market, rather than about having markets so unstable and so focused on short-term trading as to make this the starting point of the whole argument.

It’s wrong to worry about NLP technology being used to implement preventive policing measures, rather than about governments being allowed to steal their citizens’ data, to request backdoors into devices and to use preventive policing to begin with.

It’s wrong to worry about the Chinese Communist Party using facial recognition and tracking technology to limit civil rights; Worry instead about CCP ruling via a brutal dictatorship that implements such measures without anybody doing something against it.

But I digress, though I ought to give a full rebuttal of this position at some point.

3. The misinformed position

A viewpoint distinct from the previous two. It stems from misunderstanding what machine learning systems can already do. It basically consists in panicking over “new developments” which actually have existed for decades.

This view is especially worth fighting against, since it’s based on misinformation. Whereas with categories 1 and 2 I can see valid arguments arising for regulating or better understanding machine learning systems (or AI systems in general), people in the third category just don’t understand what’s going on, so they are prone to adopt any view out of sheer fear or need of belonging, without truly understanding the matter.

Until recently I thought this kind of task was better left to PBS. In hindsight, I’ve seen otherwise smart individuals being amazed that “AI” can solve a problem which anyone that has actually worked with machine learning would have been able to tell you is obviously solvable and has been since forever.

Furthermore, I think addressing this viewpoint is relevant, as it’s actually challenging and interesting. The question of “What are the problems we should assume can be solved with machine learning?”, or even narrower and more focused on current developments “What are the problems we should assume a neural network should be able to solve?”, is one I haven’t seen addressed much.

There are theories like PAC learning and AIX which at a glance seem to revolve around this, as it pertains to machine learning in general, but if actually tried in practice won’t yield any meaningful answer.

How people misunderstand what neural networks can do

Let’s look at the general pattern of fear generated by misunderstanding the machine learning capabilities we’ve had for decades.

Show a smart but relatively uninformed person—a philosophy PhD or an older street-smart businessman—a deep learning party trick.
Give them the most convoluted and scary explanation of why it works. E.g. Explain Deep-Dream by using incomplete neurological data about the visual cortex and human image processing, rather than just saying it’s the outputs of a complex edge detector overfit to recognize dog faces.
Wait for them to write an article about it in VOX & co

An example that originally motivated this article is Scott Alexander’s article post about being amazed that GPT-2 is able to learn how to play chess, poorly.

It seems to imply that GPT-2 playing chess well enough not to lose very badly against a medicore opponent (the author) is impressive and surprising.

Actually, the fact that a 1,500,000,000-parameter model designed for sequential inputs can be trained to kind of play chess is rather unimpressive, to say the least. I would have been astonished if GPT-2 were unable to play chess. Fully connected models a hundred times smaller ( https://github.com/pbaer/neural-chess) could do that more than 2 years ago.

The successful training of GPT-2 is not a feat because if a problem like chess has been already solved using various machine learning models we can assume it can be done with a generic neural network architecture (e.g. any given FC net or a FC net with a few attention layers) hundreds or thousands of times larger in terms of parameters.

When to assume a neural network can solve a problem

In the GPT-2 example, transformers (i.e. the BERT-like models inspired by the “Attention is all you need” paper’s proposed design) are pretty generic as far as NN architectures go. Not as generic as a fully connected net, arguably; they seem to perform more efficiently (in terms of training time and model size) on many tasks, and they are much better on most sequential input tasks.

So when should we assume that such generic NN architectures can solve a problem?

The answer might ease uniformed awe and might be relevant to actua problems – the kind for which “machine learning” might have been considered, but with doubt whether it’s worth bothering.

Playing chess decently is also a problem already solved. It can be done using small (compared to GPT-2) decision trees and a few very simple heuristics (see for example: https://github.com/AdnanZahid/Chess-AI-TDD). If a much smaller model can learn how to play “decently”, we should assume that a fairly generic, exponentially larger neural network can do the same.

The rule of thumb is:

1.A neural network can almost certainly solve a problem if another ML algorithm has already succeeded.

Given a problem that can be solved by an existing ML technique, we can assume that a somewhat generic neural network, if allowed to be significantly larger, can also solve it.

This assumption doesn’t always hold because:

a) Depending on the architecture, a neural network could easily be unable to optimize a given problem. Playing chess might be impossible for a conv network with large windows and step size, even if it’s very big.
b) Certain ML techniques have a lot of built-in heuristics that might be hard to learn for a neural network. The existing ML technique mustn’t have any critical heuristics built into it, or at least you have to be able to include the same heuristics into your neural network model.

As we are focusing mainly on generalizable neural network architectures (e.g. a fully connected net, which is what most people think of initially when they hear “neural network”), point a) is pretty irrelevant.

Given that most heuristics are applied equally well to any model, even for something like chess, and that size can sometimes be enough for the network to be able to just learn the heuristic, this rule basically holds almost every time.

I can’t really think of a counter example here… Maybe some specific types of numeric projections?

This is a rather boring first rule, yet worth stating as a starting point to build up from.

2. A neural network can almost certainly solve a problem very similar to ones already solved

Let’s say you have a model for predicting the risk of a given creditor based on a few parameters, e.g. current balance, previous credit record, age, driver license status, criminal record, yearly income, length of employment, {various information about current economic climate}, marital status, number of children, porn websites visited in the last 60 days.

Let’s say this model “solves” your problem, i.e. it predicts risk better than 80% of your human analysts.

But GDPR rolls along and you can no longer legally spy on some of your customers’ internet history by buying that data. You need to build a new model for those customers.

Your inputs are now truncated after and the customer’s online porn history is no longer available (or rather admittedly usable).

Is it safe to assume you can still build a reasonable model to solve this problem ?

The answer is almost certainly “yes; given our knowledge of the world, we can safely assume someone’s porn browsing history is not that relevant to their credit rating as some of those other parameters.

Another example: assume you know someone else is using a model, but their data is slightly different from yours.

You know a US-based snake-focused pet shop that uses previous purchases to recommend products and they’ve told you it’s done quite well for their bottom line. You are a UK-based parrot-focused pet shop. Can you trust their model or a similar one to solve your problem, if trained on your data ?

Again, the right answer is probably “yes”, because the data is similar enough. That’s why building a product recommendation algorithm was a hot topic 20 years ago, but nowadays everyone and their mom can just get a wordpress plugin for it and get close to Amazon’s level.

Or, to get more serious, let’s say you have a given algorithm for detecting breast cancer that—if trained on 100,000 images with follow-up checks to confirm the true diagnostics—performs better than an average radiologist.

Can you assume that, given the ability to make it larger, you can build a model to detect cancer in other types of soft tissue, also better than a radiologist ?

Once again, the answer is yes. The argument here is longer, because we aren’t so certain, mainly because of the lack of data. I’ve spent more or less a whole article arguing that the answer would still be yes.

In NLP the exact same neural network architectures seem to be decently good at doing translation or text generation in any language, as long as it belongs to the Indo European family and there is a significant corpus of data for it (i.e. equivalent to that used for training the extant models for English).

Modern NLP techniques seem to be able to tackle all language families, and they are doing so with less and less data. To some extent, however, the similarity of the data and the amount of training examples are tightly linked to the ability of a model in quickly generalizing for many languages.

Or looking at image recognition and object detection/boxing models, the main bottleneck consists in large amounts of well-labeled data, not the contents of the image. Edge cases exist, but generally all types of objects and images can be recognized and classified if enough examples are fed into an architecture originally designed for a different image task (e.g. a conv residual network designed for imagenet).

Moreover, given a network trained on imagenet, we can keep the initial weights and biases (essentially what the network “has learned”) instead of starting from scratch, and it will be able to “learn” on different datasets much faster from that starting point.

3. A neural network can solve problems that a human can solve with small-sized datapoints and little to no context

Let’s say we have 20x20px black and white images of two objects never seen before; they are “obviously different”, but not known to us . It’s reasonable to assume that, given a bunch of training examples, humans would be reasonably good at distinguishing the two.

It is also reasonable to assume, given a bunch of examples (let’s say 100), that almost any neural network of millions of parameters would ace this problem like a human.

You can visualize this in terms of amounts of information to learn. In this case, we have 400 pixels of 255 values each, so it’s reasonable to assume every possible pattern could be accounted for with a few million parameters in our equation.

But what “small datapoints” means here is the crux of this definition.

In short, “small” is a function of:

The size of your model. The bigger a model, the more complex the patterns it can learn, the bigger your possible inputs/outputs.
The granularity of the answer (output). E.g 1,000 classes vs 10 classes, or an integer range from 0 to 1,000 vs one from 0 to 100,000. In this case 2.
The size of the input. In this case 400, since we have a 20x20 image.

Take a classic image classification task like MNIST. Although a few minor improvements have been made, the state-of-the-art for MNIST hasn’t progressed much. The last 8 years have yielded an improvement from ~98.5% to ~99.4%, both of which are well within the usual “human error range”.

Compare that to something much bigger in terms of input and output size, like ImageNet, where the last 8 years have seen a jump from 50% to almost 90%.

Indeed, even with pre-CNN techniques, MNIST is basically solveable.

But even having defined “small” as a function of the above, we don’t have the formula for the actual function. I think that is much harder, but we can come up with a “cheap” answer that works for most cases - indeed, it’s all we need:

A given task can be considered small when other tasks of equal or larger input and output size have already been solved via machine learning with more than one architecture on a single GPU

This might sound like a silly heuristic, but it holds surprisingly well for most “easy” machine learning problems. For instance, the reason many NLP tasks are now more advanced than most “video” tasks is size, despite the tremendous progress on images in terms of network architecture (which are much closer to the realm of video). The input & output size for meaningful tasks on videos is much larger; on the other hand, even though NLP is in a completely different domain, it’s much closer size-wise to image processing.

Then, what does “little to no context” mean ?

This is a harder one, but we can rely on examples with “large” and “small” amounts of context.

Predicting the stock market likely requires a large amount of context. One has to be able to dig deeper into the companies to invest in; check on market fundamentals, recent earning calls, the C-suite’s history; understand the company’s product; maybe get some information from it’s employees and customers, if possible, get insider info about upcoming sales and mergers, etc.

You can try to predict the stock market based purely on indicators about the stock market, but this is not the way most humans are solving the problem.

On the other hand, predicting the yield of a given printing machine based on temperature and humidity in the environment could be solved via context, at least to some extent. An engineer working on the machine might know that certain components behave differently in certain conditions. In practice, however, an engineer would basically let the printer run, change the conditions, look at the yield, then come up with an equation. So given that data, a machine learning algorithm can also probably come up with an equally good solution, or even a better one.

In that sense, an ML algorithm would likely produce results similar to a mathematician in solving the equation, since the context would be basically non-existent for the human.

There are certainly some limits. Unless we test our machine at 4,000 C the algorithm has no way of knowing that the yield will be 0 because the machine will melt; an engineer might suspect that.

So, I can formulate this 3rd principle as:

A generic neural network can probably solve a problem if:

A human can solve it
Tasks with similarly sized outputs and inputs have already been solved by an equally sized network
Most of the relevant contextual data a human would have are included in the input data of our algorithm.

Feel free to change my mind (with examples).

However, this still requires evaluating against human performance. But a lot of applications of machine learning are interesting precisely because they can solve problems humans can’t. Thus, I think we can go even deeper.

4. A neural network might solve a problem when we are reasonably sure it’s deterministic, we provide any relevant context as part of the input data, and the data is reasonably small

Here I’ll come back to one of my favorite examples—protein folding. One of the few problems in science where data is readily available, where interpretation and meaning are not confounded by large amounts of theoretical baggage, and where the size of a datapoint is small enough based on our previous definition. You can boil down the problem to:

Around 2,000 input features (amino acids in the tertiary structure), though this means our domain will only cover 99.x% of proteins rather than literally all of them.
Circa 18,000 corresponding output features (number of atom positions in the tertiary structure, aka the shape, needing to be predicted to have the structure).

This is one example. Like most NLP problems, where “size” becomes very subjective, we could easily argue one-hot-encoding is required for this type of inputs; then the size suddenly becomes 40,000 (there’s 20 proteinogenic amino acids that can be encoded by DNA) or 42,000 (if you care about selenoproteins and 44,000 if you care about niche proteins that don’t appear in eukaryotes).

It could also be argued that the input & output size is much smaller, since in most cases proteins are much smaller and we can mask & discard most of inputs & outputs for most cases.

Still, there are plenty of tasks that go from an, e.g. 255x255 pixel image to generate another 255x255 pixel image (style alternation, resolution enhancement, style transfer, contour mapping… etc). So based on this I’d posite the protein folding data is reasonably small and has been for the last few years.

Indeed, resolution enhancement via neural networks and protein folding via neural networks came about at around the same time (with every similar architecture, mind you). But I digress; I’m mistaking a correlation for the causal process that supposedly generated it. Then again, that’s the basis of most self-styled “science” nowadays, so what is one sin against the scientific method added to the pile ?

Based on my own fooling around with the problem, it seems that even a very simple model, simpler than something like VGG, can learn something ”meaningful” about protein folding. It can make guesses better than random and often enough come within 1% of the actual position of the atoms, if given enough (135 millions) parameters and half a day of training on an RTX2080. I can’t be sure about the exact accuracy, since apparently the exact evaluation criterion here is pretty hard to find and/or understand and/or implement for people that aren’t domain experts… or I am just daft, also a strong possibility.

To my knowledge the first widely successful protein folding network AlphaFold, whilst using some domain-specific heuristics, did most of the heavy lifting using a residual CNN, an architecture designed for categorizing images, something as widely unrelated with protein folding as one can think of.

That is not to say any architecture could have tackled this problem as well. It rather means we needn’t build a whole new technique to approach this type of problem. It’s the kind of problem a neural network can solve, even though it might require a bit of looking around for the exact network that can do it.

The other important thing here is that the problem seems to be deterministic. Namely:

a) We know peptides can be folded into proteins, in the kind of inert environment that most of our models assume, since that’s what we’ve always observed them to do.
b) We know that amino acids are one component which can fully describe a peptide
c) Since we assume the environment is always the same and we assume the folding process itself doesn’t much alter it, the problem is not a function of the environment (note, obviously in the case of in-vitro folding, in-vivo the problem becomes much harder)

The issue arises when thinking about b), that is to say, we know that the universe can deterministically fold peptides; we know amino acids are enough to accurately describe a peptide. However, the universe doesn’t work with “amino acids”, it works with trillions of interactions between much smaller particles.

So while the problem is deterministic and self-contained, there’s no guarantee that learning to fold proteins doesn’t entail learning a complete model of particle physics that is able to break down each amino acid into smaller functional components. A few million parameters wouldn’t be enough for that task.

This is what makes this 4th most generic definition the hardest to apply.

Some other examples here are things like predictive maintenance where machine learning models are being actively used to tackle problems human can’t, at any rate not without mathematical models. For these types of problems, there’s strong reasons to assume, based on the existing data, that the problems are partially (mostly?) deterministic.

There are simpler examples here, but I can’t think of any that, at the time of their inception, didn’t already fall into the previous 3 categories. At least, none that aren’t considered reinforcement learning.

The vast majority of examples fall within reinforcement learning, where one can solve an impressive amount of problems once they are able to simulate them.

People can find optimal aerodynamic shapes, design weird antennas to provide more efficient reception/coverage, beat video games like DOT and Starcraft which are exponentially more complex (in terms of degrees of freedom) than chess or Go.

The problem with RL is that designing the actual simulation is often much more complicated than using it to find a meaningful answer. RL is fun to do but doesn’t often yield useful results. However, edge cases do exist where designing the simulation does seem to be easier than extracting inferences out of it. Besides that, the more simulations advance based on our understanding of efficiently simulating physics (in itself helped by ML), the more such problems will become ripe for the picking.

In conclusion

I’ve attempted to provide a few simple heuristics for answering the question “When should we expect that a neural network can solve a problem ?”. That is to say, to what problems should you apply neural networks, in practice, right now. What problems should leave you “unimpressed” when solved by a neural network ? For which problems should our default hypothesis include their solvability, given enough architecture searching and current GPU capabilities.

I think this is fairly useful—not only for not getting impressed when someone shows us a party trick and tells us it’s AGI—but also for helping us quickly classify a problem as “likely solvable via ML” and “unlikely to be solved by ML”

To recap, neural networks can probably solve your problem:

[Almost certainty] If other ML models already solved the problem.
[Very high probability] If a similar problem has already been solved by an ML algorithm, and the differences between that and your problem don’t seem significant.
[High probability] If the inputs & outputs are small enough to be comparable in size to those of other working ML models AND if we know a human can solve the problem with little context besides the inputs and outputs.
[Reasonable probability] If the inputs & outputs are small enough to be comparable in size to those of other working ML models AND we have a high certainty about the deterministic nature of the problem (that is to say, about the inputs being sufficient to infer the outputs).

I am not certain about any of these rules, but this comes back to the problem of being able to say something meaningful. PACL can give us almost perfect certainty and is mathematically valid but it breaks down beyond simple classification problems.

Coming up with this kind of rules doesn’t provide an exact degree of certainty and they are derived from empirical observations. However, I think they can actually be applied to real world problems.

Indeed, these are to some extent the rules I do apply to real world problems, when a customer or friend asks me if a given problem is “doable”. These seem to be pretty close to the rules I’ve noticed other people using when thinking about what problems can be tackled.

So I’m hoping that this could serve as an actual practical guide for newcomers to the field, or for people that don’t want to get too involved in ML itself, but have some datasets they want to work on.