Reinforcement learning I – Temporal difference learning


After I’ve started working with reward-modulated STDP in spiking neural networks, I got curious about the background of research on which it was based. This led me to the book by Richard Sutton and Andrew Barto called “Reinforcement Learning”.  The book is from 1998 and it’s freely readable on the internet! In the book’s Introduction they cover the example of an agent learning to beat a given (imperfect) agent in the game of Tic Tac Toe. Two remarks have to be made: 1. The agent has to be imperfect because a perfect agent in Tic Tac Toe (if it’s the one doing the first move) can never be beaten. 2. The agent does not learn to play Tic Tac Toe, this skill is assumed, but it learns a value map for its policy.

Since I liked the example and wanted to try it out myself, I decided to write this blog post about it. By the way, the code can be found on github (run


A few quotes taken from [Sutton & Barto 2005]:

Reinforcement learning is different from supervised learning, the kind of learning studied in most current research in machine learning, statistical pattern recognition, and artificial neural networks. Supervised learning is learning from examples provided by a knowledgable external supervisor.

Clearly, such an agent must be able to sense the state of the environment to some extent and must be able to take actions that affect the state. The agent also must have a goal or goals relating to the state of the environment. The formulation is intended to include just these three aspects–sensation, action, and goal–in their simplest possible forms without trivializing any of them.

Figure taken from [Sutton & Barto 2005]

One of the challenges that arise in reinforcement learning and not in other kinds of learning is the trade-off between exploration and exploitation. To obtain a lot of reward, a reinforcement learning agent must prefer actions that it has tried in the past and found to be effective in producing reward. But to discover such actions, it has to try actions that it has not selected before. The agent has to exploit what it already knows in order to obtain reward, but it also has to explore in order to make better action selections in the future. The dilemma is that neither exploration nor exploitation can be pursued exclusively without failing at the task.

The 4 elements of reinforcement learning

Quotes taken from [Sutton & Barto 2005]

  • A policy is a mapping from perceived states of the environment to actions to be taken when in those states. […] The policy is the core of a reinforcement learning agent in the sense that it alone is sufficient to determine behavior. In general, policies may be stochastic.

  • A reward function defines the goal in a reinforcement learning problem. Roughly speaking, it maps each perceived state (or state-action pair) of the environment to a single number, a reward, indicating the intrinsic desirability of that state. A reinforcement learning agent’s sole objective is to maximize the total reward it receives in the long run.

  • Whereas a reward function indicates what is good in an immediate sense, a value function specifies what is good in the long run. Roughly speaking, the value of a state is the total amount of reward an agent can expect to accumulate over the future, starting from that state. […]
    We seek actions that bring about states of highest value, not highest reward, because these actions obtain the greatest amount of reward for us over the long run. […]
    Rewards are basically given directly by the environment, but values must be estimated and reestimated from the sequences of observations an agent makes over its entire lifetime. In fact, the most important component of almost all reinforcement learning algorithms is a method for efficiently estimating values. The central role of value estimation is arguably the most important thing we have learned about reinforcement learning over the last few decades.

  • The fourth and final element of some reinforcement learning systems is a model of the environment. This is something that mimics the behavior of the environment. For example, given a state and action, the model might predict the resultant next state and next reward.

Learning a value function for Tic Tac Toe

The learning process comprises N games. Each game consists of $latex K \geq 3$ turns. At the beginning of every game, the board gets reset to a state in which all nine fields are empty. Our agent plays Os and therefore plays the first turn in every game. At the beginning of the learning process the value map is initialized to 0.5 for every board state except for board states that show winning configurations for O, those have value 1.0, and board states that show draws or winning configurations for X, those have value 0.0.

Our agent can make two types of moves (actions): 1. Exploratory moves that ignore the value of the resulting board configuration and 2. Greedy moves that seek to maximize the value of the resulting board configuration. The probability of picking an exploratory move is p_{ex}.

After every greedy move, we update the value of the state s before our opponents last move to become a bit closer to the value of the state s' after our last move: V(s) \leftarrow V(s) + \alpha [V(s') - V(s)]. This is detailed in the following picture:

Figure taken from [Sutton & Barto 2005]

The idea behind this update rule is to make the value of that would have resulted from our previous move closer to the value after our current move. This way, the value propagates back from the goal state along our chain of actions towards the start state.

According to [Sutton & Barto 2005],

This update rule is an example of a temporal-difference learning method, so called because its changes are based on a difference, V(s') - V(s), between estimates at two different times.

Each game is a series of board configurations. For greedy turns, the agent has to calculate all board configurations for all possible actions and pick the one with the highest value. Since the value map is initialized to 0.5 for all states except winning and losing ones and since there are 19,683 board states (naively counted) it makes sense to evaluate the value map lazily (although I must admit they I have not optimized my code and I have not verified this assumption).

An implementation of TD learning for TTT

I’ve implemented a version of this in Python. The code can be found on github (run This version chooses exploratory steps with a probability of p_{ex}=0.2.

Further questions

  • What is dynamic programming?
  • What are the combinatorics of TTT (c.f. Wikipedia article)


Reed-Solomon Codes

While trying to understand QR codes, I came across many explanations of Reed-Solomon codes. These codes are used to encode the data in a QR code and make it more robust against wrong pixels. In particular, they allow for recovery of a certain amount of errors. The best intro to Reed-Solomon codes I found so far is this one.

Making a mess is part of being creative

Opposite to the Telegraph headline Having a messy desk makes you ‘more creative’, a messy desk does probably not make you more creative, but creative people produce more mess. Or, in other words, making a mess is definitely a part of the creative process.

What do I mean by that: If you want to be creative you will first have to play around with the tools you have at hand. And during this process it is important to just try something without thinking too much about the ultimate goal. Part of the process is just to do something, see what happens and then start over, until eventually, you arrive at something you like.

The part of making a mess might involve writing down fragments of sentences onto a sheet of paper or typing it into a text editor. It might also involve drawing all sorts of funny ideas onto many sheets of paper, or even drawing the same thing over and over again while refining it in the process. Or, when writing software, it involves creating many different versions of the same software while refining the structure of the code.

Of course, at some point you might end up having to prepare a final version that you want to deploy. This is then the least pleasant part, it’s the part when you need to file through all the ideas that you’ve produced and pick the gems out of the dusty ruins. At this point you need to be able to remember what you did and where you roughly did that.

I think Christoph Niemann summarized this process quite nicely in Abstract: The Art of Design: You want to „be a much more ruthless editor and at the same time be a much more careless artist.“ He also summarized this process nicely in this article for the New Yorker.

Thus, you want to iterate between being a careless and  almost child-like artist and on the other hand being a ruthless editor that drives the iterative process into the right direction.

My favourite quotes by Isaac Newton

  1. If I have ever made any valuable discoveries, it has been owing more to patient attention, than to any other talent.

    Found via

  2. If I have seen further it is by standing on the shoulders of giants.

  3. I do not know what I may appear to the world, but to myself I seem to have been only like a boy playing on the sea-shore, and diverting myself in now and then finding a smoother pebble or a prettier shell than ordinary, whilst the great ocean of truth lay all undiscovered before me.

When your program takes too long, you get distracted

You should instead use the time to optimize your program or tools until these distractions cede.

Ultimately, you want to test your software until it yields the expected results, either by passing all tests or by returning results of a measurement for example. But sometimes, your initially conceived software architecture or test set or tools don’t live up to the challenge. By this time tour program has become complex enough that it takes a while to run all the tests or to measure all your data.

And during this time the distraction devil sneaks in and wastes your time. When what you actually want to do is use that time to improve your actual measurement or evaluate the data and think about what else you would want to measure.

Therefore, you should instead work on refactoring your software, your tests or improve on your tools. I have the impression the part of a programmer’s work from which interesting open source tools emerge (e.g. editor plug-ins).

Whenever you catch yourself idling (i.e. reading random stuff on the Internet in time slots which are not designated to this) then think about (and feel) where your own software or tools are limiting you. Start from scratch, improve on your tools or refactor your software until it does exactly what you need it to do.

If you’re too tired to do any of this then take a break or do something that distracts your mind from this problem for a while. But do not sit in front of the computer pretending you’re working on a problem when actually that problem has defeated you and you’re escaping the truth by facing some random distraction instead of the venue of the war. Release the pause button, get back in and fight 😉