Faculty of Mathematics and Physics

In recent years, **reinforcement learning** has been combined with **deep neural
networks**, giving rise to game agents with **super-human performance** (for example
for Go, chess, or 1v1 Dota2, capable of being trained solely by self-play),
datacenter cooling algorithms being 50% more efficient than trained human
operators, or improved machine translation. The goal of the course is to
**introduce reinforcement learning employing deep neural networks**, focusing both
on the **theory** and on **practical implementations**.

**Python** programming skills and **TensorFlow** skills (or any other deep
learning framework) are required, to the extent of the
NPFL114 course. No previous
knowledge of reinforcement learning is necessary.

SIS code: NPFL122

Semester: winter

E-credits: 6

Examination: 2/2 C+Ex

Guarantor: Milan Straka

**lecture**: the lecture is held on Monday 15:40 in S4; first lecture is on**Oct 07****practicals**: there are two parallel practicals, on Monday 17:20 in SU1 and on Tuesday 10:40 in SU2; first practicals are on**Oct {07,08}**

1. Introduction to Reinforcement Learning Slides PDF Slides multiarmed_bandits

2. Markov Decision Process, Optimal Solutions, Monte Carlo Methods Slides PDF Slides policy_iteration monte_carlo

To pass the practicals, you need to obtain at least **80** points (excluding
the bonus points), which are awarded for home assignments. Note that up to
**40** points above 80 will be transfered to the exam.

To pass the exam, you need to obtain at least 60, 75 and 90 out of 100 points for the written exam (plus up to 40 points from the practicals), to obtain grades 3, 2 and 1, respectively.

The lecture content, including references to study materials.

The main study material is the Reinforcement Learning: An Introduction; second edition by Richard S. Sutton and Andrew G. Barto (reffered to as RLB). It is available online and also as a hardcopy.

References to study materials cover **all theory required** at the exam,
and sometimes even more – the references in *italics* cover topics
**not required** for the exam.

Oct 07 Slides PDF Slides multiarmed_bandits

*History of RL [Chapter 1 of RLB]*- Multi-armed bandits [Chapter 2 of RLB]

Oct 14 Slides PDF Slides policy_iteration monte_carlo

- Markov Decision Process [Sections 3-3.3 of RLB]
- Policies and Value Functions [Sections 3.5-3.6 of RLB]
- Value Iteration [Sections 4 and 4.4 of RLB]
- Proof of convergence only in slides

- Policy Iteration [Sections 4.1-4.3 of RLB]
- Generalized Policy Iteration [Section 4.6 or RLB]
- Monte Carlo Methods [Sections 5-5.4 of RLB]

The tasks are evaluated automatically using the ReCodEx Code Examiner. The evaluation is performed using Python 3.6, TensorFlow 2.0.0, NumPy 1.17.2 and OpenAI Gym 0.14.0. For those using PyTorch, CPU version 1.2.0 is available.

You can install TensorFlow and Gym either to user packages using
`pip3 install --user tensorflow==2.0.0 gym==0.14.0 scipy box2d-py atari-py`

(with the last three backages being optinal dependencies of `gym`

),
or create a virtual environment using `python3 -m venv VENV_DIR`

and then installing
the packages inside it by running
`VENV_DIR/bin/pip3 install ...`

.
On Windows, you can use third-party precompiled versions of
box2d-py.

Working in teams of size 2 (or at most 3) is encouraged. All members of the team
must submit in ReCodEx individually, but can have exactly the same
sources/models/results. **However, each such solution must explicitly list all
members of the team to allow plagiarism detection using
this template.**

Deadline: Oct 20, 23:59 8

Perform a parameter study of various approaches to solving multiarmed bandits. For every hyperparameter choice, perform 100 episodes, each consisting of 1000 trials, and report the average and standard deviation of the 100 episode returns.

Start with the multiarmed_bandits.py
template, which defines `MultiArmedBandits`

environment. We use API based on
OpenAI Gym `Environment`

class, notably the following
two methods:

`reset() → new_state`

: starts a new episode`step(action) → new_state, reward, done, info`

: perform the chosen action in the environment, returning the new state, obtained reward, a boolean flag indicating an end of episode, and additional environment-specific information Of course, the states are not used by the multiarmed bandits (`None`

is returned).

Your goal is to implement the following modes of calculation. In addition to submitting the solution to ReCodEx, you should use multiarmed_bandits_draw.py to plots the results in a graph.

`greedy`

*[2 points]*: perform $ε$-greedy search with parameter`epsilon`

, computing the value function using averaging. (Results for $ε ∈ \{1/64, 1/32, 1/16, 1/8, 1/4\}$ are plotted.)`greedy`

and`alpha`

$≠0$*[1 point]*: perform $ε$-greedy search with parameter`epsilon`

and initial function estimate of 0, using fixed learning rate`alpha`

. (Results for $α=0.15$ and $ε ∈ \{1/64, 1/32, 1/16, 1/8, 1/4\}$ are plotted.)`greedy`

,`alpha`

$≠0$ and`initial`

$≠0$*[1 point]*: perform $ε$-greedy search with parameter`epsilon`

, given`initial`

value as starting value function and fixed learning rate`alpha`

. (Results for`initial`

$=1$, $α=0.15$ and $ε ∈ \{1/128, 1/64, 1/32, 1/16\}$ are plotted.)`ucb`

*[2 points]*: perform UCB search with confidence level`c`

and computing the value function using averaging. (Results for $c ∈ \{1/4, 1/2, 1, 2, 4\}$ are plotted.)`gradient`

*[2 points]*: choose actions according to softmax distribution, updating the parameters using SGD to maximize expected reward. (Results for $α ∈ \{1/16, 1/8, 1/4, 1/2\}$ are plotted.)

Deadline: Nov 03, 23:59 5 points

Consider the following gridworld:

Start with policy_iteration.py, which implements the gridworld mechanics, by providing the following methods:

`GridWorld.states`

: return number of states (`11`

)`GridWorld.actions`

: return lists with labels of the actions (`["↑", "→", "↓", "←"]`

)`GridWorld.step(state, action)`

: return possible outcomes of performing the`action`

in a given`state`

, as a list of triples containing`probability`

: probability of the outcome`reward`

: reward of the outcome`new_state`

: new state of the outcome

Implement policy iteration algorithm, with `--steps`

steps of policy
evaluation/policy improvement. During policy evaluation, use the current value
function and perform `--iterations`

applications of the Bellman equation.
Perform the policy evaluation synchronously (i.e., do not overwrite the current
value function when computing its improvement). Assume the initial policy is
“go North” and initial value function is zero.

After given number of steps and iterations, print the resulting value function and resulting policy. For example, the output after 4 steps and 4 iterations should be:

```
9.15→ 10.30→ 11.32→ 12.33↑
8.12↑ 3.35← 2.58←
6.95↑ 5.90← 4.66← -4.93↓
```

Deadline: Nov 03, 23:59 6 points

Solve the CartPole-v1 environment environment from the OpenAI Gym using the Monte Carlo reinforcement learning algorithm.

Use the supplied cart_pole_evaluator.py module (depending on gym_evaluator.py) to interact with the discretized environment. The environment has the following methods and properties:

`states`

: number of states of the environment`actions`

: number of actions of the environment`episode`

: number of the current episode (zero-based)`reset(start_evaluate=False) → new_state`

: starts a new episode`step(action) → new_state, reward, done, info`

: perform the chosen action in the environment, returning the new state, obtained reward, a boolean flag indicating an end of episode, and additional environment-specific information`render()`

: render current environment state

Once you finish training (which you indicate by passing `start_evaluate=True`

to `reset`

), your goal is to reach an average return of 490 during 100
evaluation episodes. Note that the environment prints your 100-episode
average return each 10 episodes even during training.

You can start with the monte_carlo.py template, which parses several useful parameters, creates the environment and illustrates the overall usage.

During evaluation in ReCodEx, three different random seeds will be employed, and you need to reach the required return on all of them. Time limit for each test is 5 minutes.