Introduction to Reinforcement Learning

December 3, 2022

Reinforcement Learning


An example of reinforcement learning: the agent learns optimally to backflip.
Source: Composing Value Functions in Reinforcement Learning

In this blogpost we will introduce the most important concepts in Reinforcement learning and provide some real world examples, to show you how and when it is currently being used.

This series is based on the lectures that DeepMind held for UCL in 2020, they are freely available at this YouTube link.
I suggest you to watch these videos because they are extremely clear in my opinion. I chose to write this series of blogposts to offer a readable format of the content of DeepMind’s lectures.

Introduction

Reinforcement Learning is the branch of Computer Science that studies how to learn to perform tasks by modeling them as the maximization of a reward obtained by interacting with an environment.

This interaction can be summarized in the following diagram:


Reinforcement Learning entities diagram

Let’s describe it, introducing right away some terminology:

Shortly, at each timestep \(t\) the agent

This was a very high level description of the core entities involved in the reinforcement learning problem. But what is it used in? Here are some examples:

Now let’s delve into a lower level description, where I will introduce the mathematical terminology that will be used throughout the next episodes.

The Environment

It is a system in which you can perform actions and, consequently, it will

This type of system is modeled as a Markov decision process, namely a mathematical framework that models decision making problems.
Look at the image below. It represents a graph with three nodes \((S_0, S_1, S_2)\).
From each state you can perform some actions that will lead you to another state with a certain probability. We associate a reward (the orange arrows) to some of these actions.

On your left, an example of Markov Decision Process. Source: Wikipedia.

Rigorously, a Markov Decision Process is defined as a

Moreover, the transition probability distribution \(P\) must satisfy the Markov property:

\[P(s' |s_t, a_t) = P(s'|\mathcal{H}_t, a_t)\] \[\begin{equation} R(s'|s_t, a_t) = R(s'|\mathcal{H}_t, a_t) \end{equation}\]

This means that

  1. the state \(s'\) into which we transition after performing action \(a_t\) from state \(s_t\)
  2. the obtained reward \(r\)

are independent from the history of the game (namely, the sequence of actions, observations, and rewards obtained so far).

In other words, \(s_t\) contains all the information we need to choose what action to perform next.

We have described extensively how we model an environment in Reinforcement Learning. It’s important though to note that the agent could have full knowledge about the environment state or not.

An example of Partially Observable Environment

Imagine you are looking at a maze from above.

On the left, the entire labyrinth seen from above: the full environment state.
On the left, two potential state observations. The two observations are indistinguishable, but they refer to two different locations in the labyrinth.

These two agent states are not Markovian because the only way we have to understand whether we are on the left side or the right side of the labyrinth is to analyze the agent’s history, checking what was the observation at previous timesteps.

You could not construct a Markov agent state in this maze.

To deal with partial observability, an agent can construct a suitable state representation \(\to\) this is when the concepts of value function and action value function come in handy.

The Value Function

The value function is a function \(v(s): S \to \mathbb{R}\) that returns for each state \(s \in S\) the expected cumulative reward the agent will get by finding himself in such state.

\[\begin{equation} \begin{split} v(s) &= \mathbb{E}[G_t | S_t = s] \\ &= \mathbb{E}[R_{t+1} + R_{t+2} + R_{t+3} + \dots |S_t = s] \end{split} \end{equation}\]

The Policy

A policy is a function \(\pi: S \to A\) that, given a state \(s\), tells the agent what is the optimal action \(a\) to perform (the optimal action is the one that will lead him to the state with the highest expected cumulative reward).
A policy can be deterministic (the state-action mapping is univoque) or stochastic. In this latter case you will get in output a probability distribution over the actions set \(A\).

The Q value

A function that maps a state action pair \((s, a)\) to the expected cumulative reward.

\[\begin{equation} \begin{split} q(s,a) &= \mathbb{E}[G_t|S_t=s, A_t=a] \\ &=\mathbb{E}[R_{t+1}+ R_{t+2}+ R_{t+3}+\dots | S_t=s, A_t=a] \end{split} \end{equation}\]

That’s all for this introductory blogpost! See you in the next one.

Introduction to Reinforcement Learning - December 3, 2022 -