Reinforcement Learning: A Non-Standard Introduction (Part 1)

Imagine that the world is divided into two parts: one we shall call the agent and the rest—its environment. Imagine you could describe in full detail the state of both the agent and the environment. The state of the agent is denoted M: it could be a Mind if you’re a philosopher, a Machine if you’re researching machine learning, or a Monkey if you’re a neuroscientist. Anyway, it’s just the Memory of the agent. The state of the rest of the World (or just World, for short) is denoted W.

These states change over time. In general, when describing the dynamics of a system, we specify how each state is determined by the previous states. So we have probability distributions for the states Wt and Mt of the world and the agent in time t:

p(Wt|Wt-1,Mt-1)

q(Mt|Wt-1,Mt-1)

This gives us the probabilities that the world is currently in state Wt, and the agent in state Mt, given that they previously were in states Wt-1 and Mt-1. This can be illustrated in the following Bayesian network (see also):

Bayesian networks look like they represent causation: that the current state is “caused” by the immediately previous state. But what they really represent is statistical independence: that the current joint state (Wt, Mt) depends only on the immediately previous joint state (Wt-1, Mt-1), and not on any earlier state. So the power of Bayesian networks is in what they don’t show, in this case there’s no arrow from, say, Wt-2 to Wt.

The current joint state of the world and the agent represents everything we need to know in order to continue the dynamics forward. Given this state, the past is independent of the future. This property is so important, that it has a name, borrowed from one of its earliest researchers, Markov.

The Markov property is not enough for our purposes. We are going to make a further assumption, which is that the states of the world and the agent don’t both change together. Rather, they take turns changing, and while one does the other remains the same. This gives us the dynamics:

p(Wt|Wt-1,Mt-1)

q(Mt|Mt-1,Wt)

and the Bayesian network:

Sometimes this assumption can be readily justified. For example, let’s use this model to describe a chess player.

Suppose that at time t the game has reached state Wt where it is our agent’s turn to play. Our agent has also reached a decision of what to do next, and its mind is now in state Mt, including memory, plan, general knowledge of chess, and all.

Our agent takes its turn, and then enters stasis: we are going to assume that it’s not thinking off-turn. This is true of most existing artificial chess players, and disregarding time constraints their play is not worse off for it. They are not missing out on anything other than time to think. So the agent keeps its state until the opponent has taken its turn. This completes the change of the state of the game from Wt to Wt+1.

Now the agent takes a look at the board, and starts thinking up a new strategy to counter the last move of the opponent. If reaches a decision, and commits to its next action. This completes the change of the agent’s state from Mt to Mt+1.

Chess is a turn-based game. But even in other scenarios, when such division of the dynamics into turns is not a good approximation of the process, our assumption can still be justified. If the length of each time step is taken to be smaller and smaller, the state of each of the parties remains more and more the same during each step, with increasing probability and accuracy. In the limit where we describe a continuous change of state over time, the turn-based assumption disappears, and we are back to the general model.


This is the first part of an intuitive and highly non-standard introduction to reinforcement learning. This is more typical of what neuroscientists mean when they use the term. We, on the other hand, will get closer as we move forward to its meaning in machine learning (but not too close).

In following posts we will continue to assume the Markov property in its turn-based variant. We will describe the model in further detail and explore its decision-making aspect.

Continue reading: Part 2