A short introduction to machine learning

Despite the current popularity of machine learning, I haven’t found any short introductions to it which quite match the way I prefer to introduce people to the field. So here’s my own. Compared with other introductions, I’ve focused less on explaining each concept in detail, and more on explaining how they relate to other important concepts in AI, especially in diagram form; I hope that this makes it useful for people who, like me, prefer to develop a top-down understanding of new fields. I’m aware that high-level taxonomies can be controversial, and also that it’s easy to fall into the illusion of transparency when trying to introduce a field; so suggestions for improvements are very welcome!

The key ideas are contained in this summary diagram:

First, some quick clarifications:

  • None of the boxes are meant to be comprehensive; we could add more items to any of them. So you should picture each list ending with “and others”.

  • The distinction between tasks and techniques is not a firm or standard categorisation; it’s just the best way I’ve found so far to lay things out.

  • The summary is explicitly from an AI-centric perspective. For example, statistical modelling and optimisation are fields in their own right; but for our current purposes we can think of them as machine learning techniques.

Let’s dig into each part of the diagram now, starting from the top.

Paradigms of artificial intelligence

The field of artificial intelligence attempts to develop computer programs that possess the capabilities associated with intelligence in humans: language skills, visual perception, motor control, and so on. It got started around the 1950s. Historically, there have been several different approaches to AI. In the first few decades, the dominant paradigm was symbolic AI, which focused on representing problems using high-level mathematical equations, then solving them using search and logic. One highlight was Deep Blue, the chess AI that beat Kasparov in 1997. However, the symbolic representations designed by AI researchers turned out to be far too simple to allow symbolic AIs to handle complex real-world phenomena.

Since the 1990s, the dominant paradigm in AI has been machine learning, which allows AIs to improve their performance based on experience and feedback (known as the learning, training or optimisation process).* The most basic machine learning techniques are statistical models, such as linear regression—which in its simplest form only learns the values of two parameters to represent the training data. Although most people don’t think of linear regression as a machine learning technique, it’s hard to draw a clear boundary between statistical models and more central examples of machine learning techniques; hence I’ve included statistical modelling in the diagram above. However, the biggest successes of machine learning have come from applying techniques at a much larger scale than standard statistical modelling—in particular by training large neural networks with many layers, using powerful optimisation techniques like backpropagation. This is known as deep learning. Neural networks have been around since the beginning of AI, but they only became the dominant paradigm in the early 2010s, after increases in compute availability allowed us to train much bigger networks. Let’s explore the components of deep learning in more detail now.

Deep learning: neural networks and optimisation

Neural networks are a type of machine learning model inspired by the brain. They consist of multiple connected layers of artificial neurons, represented by circles in the diagram below. Note that networks with more than one layer between the input and the output layers are known as deep neural networks; these days, almost all neural networks are deep.

Like biological neurons, each artificial neuron receives signals from other neurons, combines them together into a single value (known as its activation), and then passes that value on to other neurons. As in biological brains, the signal that is passed between a pair of artificial neurons is affected by the strength of the connection between them—so for each of the lines in the diagram we need to store a single number representing the strength of the connection, known as a weight. Unlike in brains, though, the neurons in an artificial neural network are organised into layers, where each neuron only receives signals from the previous layer, and only passes on signals to the next layer. The weights of a neuron’s connections to the previous layer determines how strongly it activates for any given input.

These weights are not manually specified, but instead they are learned via a process of optimisation, which finds weights that make the network score highly on whatever metric we’re using. (This metric is known as an objective function or loss function; it’s evaluated over whatever dataset we’re using during training.) By far the most common optimisation algorithm is gradient descent, which initially sets weights to arbitrary values, and then at each step changes them so that the network does slightly better on its objective function (in more technical terms, it updates each weight in the direction of its gradient with respect to the objective function). Gradient descent is a very general optimisation algorithm, but it’s particularly efficient when applied to neural networks because at each step the gradients of the weights can be calculated layer-by-layer, starting from the last layer and working backwards, using the backpropagation algorithm. This allows us to train networks which contain billions of weights, each of which is updated billions of times.

As a result of optimisation, the weights end up storing information which allows different neurons to recognise different features of the input. As an example, consider a neural network known as Inception, which was trained to classify images. Each neuron in Inception’s input layer was assigned to a single pixel of the input image. Neurons in each successive layer then learned to activate in response to increasingly high-level features of the input image. The diagram shows some of the patterns recognised by neurons in five consecutive layers from the Inception model, in each case by combining patterns from the previous layer—from colours to (Gabor filters for) textures to lines to angles to curves. This goes on until the last layer, which represents the network’s final output—in this case the probabilities of the input image containing cats, dogs, and various other types of object.

One last point about neural networks: in our earlier neural network diagram, every neuron in a given layer was connected to every neuron in the layers next to it. This is known as a fully-connected network, the most basic type of neural network. In practice, fully-connected networks are seldom used; instead there are a whole range of different neural network architectures which connect neurons in different ways. Three of the most prominent (convolutional networks, recurrent networks, and transfomers) are listed in the original summary diagram; however, I won’t cover any of the details here.

Machine learning tasks

I’ve described how neural networks (and other machine learning models) can be trained to perform different tasks. These tasks are often divided into three categories—supervised, unsupervised, and reinforcement learning—based on what type of data and what type of objective function they use. The first is supervised learning, which requires a dataset where each datapoint has a corresponding label. The objective in supervised learning is for a model to predict the labels which correspond to each datapoint. For example, the image classification network we discussed above was trained on a dataset of images, each labeled with the type of object it contained. Alternatively, if the labels had been ratings of how beautiful each image was, we could have used supervised learning to produce a network that rated image beauty. These two examples showcase different types of supervised learning: the former is a classification problem (requiring the prediction of discrete categories) and the latter is a regression problem (requiring the prediction of continuous values). Historically, supervised learning has been the most studied task in machine learning, and techniques devised to solve it have been extensively used as parts of the solutions to the other two.

One downside of supervised learning is that labelling a dataset usually needs to be done manually by humans, which is expensive and time-consuming. Learning from an unlabelled dataset is known as unsupervised learning. The standard objective in unsupervised learning is to try to predict or compress the dataset itself—which often allows us to generate more data that is similar to the training data. Some impressive applications of unsupervised learning are GPT-2 and GPT-3 for language, and Dall-E for images.**

Finally, in reinforcement learning, the data source is not a fixed dataset, but rather an environment in which the AI takes actions and receives observations—essentially as if it’s playing a video game. After each action, the agent also receives a reward (similar to the score in a video game), which is used to reinforce the behaviour that leads to high rewards, and reduce the behaviour that leads to low rewards. Since actions can have long-lasting consequences, the key difficulty in reinforcement learning is determining which actions are responsible for which rewards—a problem known as credit assignment. So far the most impressive demonstrations of reinforcement learning have been in training agents to play board games and esports—most notably AlphaGo, AlphaStar and OpenAI Five.

Here’s a more detailed breakdown of some of the tasks and techniques corresponding to these three types of learning. I’ve only mentioned a few of these terms so far; I’ve included the others to help you classify them in case you’ve seen them before, but don’t worry if many of them are unfamiliar.

Solving real-world tasks

We’re almost done! But I don’t think that even a brief summary of AI and machine learning can be complete without adding two more concepts. They don’t quite fit into the taxonomy I’ve been using so far, so I’ve modified the original summary diagram to fit them in:

Let’s think of these three dotted lines I’ve added as ways to connect the different levels. The ultimate goal of the field of AI is to create systems that can perform valuable tasks in the real world. In order to apply machine learning techniques to achieve this, we need to set up supervised/​unsupervised/​reinforcement learning tasks which correspond closely to the abilities we’d like our systems to have. I’m calling this step task design. A key aspect is designing datasets or environments which are as similar as possible to the real-world task. In reinforcement learning, task design also includes the problem of designing a reward function to specify the desired behaviour, which is often more difficult than we expect.

But no matter how well we design a machine learning task, we will face two problems. Firstly, we can only ever train our models on a finite amount of data from that task. For example, when training an AI to play chess, there are many possible board positions that it will never experience. So our optimisation algorithms could in theory produce chess AIs that can only play well on positions that they already experienced during training. In practice this doesn’t happen: instead deep learning tends to generalise incredibly well to examples it hasn’t seen already. How and why it does so is, however, still poorly-understood.

Secondly, due to the immense complexity of the real world, there will be ways in which our machine learning tasks are incomplete or biased representations of the real-world tasks we really care about. For example, consider an AI which has been trained to play chess against itself, and which now starts to play against a human who has very different strengths and weaknesses. In some sense playing well against the human requires it to transfer its original experience to this new task (although the line between generalisation to different examples of “the same task” versus transfer to “a new task” is very blurry). We’re also beginning to see neural networks whose skills transfer to new tasks which differ significantly from the ones on which they were trained—most notably the GPT-3 language model, which can perform a very wide range of tasks. As we develop increasingly powerful AIs that perform increasingly important real-world tasks, ensuring their safe behaviour will require a much better understanding of how their skills and motivations will transfer from their training environments to the wider world.


* Learning, training and optimisation have slightly different meanings, but they all refer to the process by which a machine learning system gains skills based on data.

** Note that the training process used for these is more specifically known as self-supervised learning, which can be thought of as halfway between unsupervised and supervised learning. For simplicity, though, I’ve classified it here as unsupervised learning.

No comments.