# Deep Reinforcement Learning – Winter 2019/20

In recent years, reinforcement learning has been combined with deep neural networks, giving rise to game agents with super-human performance (for example for Go, chess, or 1v1 Dota2, capable of being trained solely by self-play), datacenter cooling algorithms being 50% more efficient than trained human operators, or improved machine translation. The goal of the course is to introduce reinforcement learning employing deep neural networks, focusing both on the theory and on practical implementations.

Python programming skills and TensorFlow skills (or any other deep learning framework) are required, to the extent of the NPFL114 course. No previous knowledge of reinforcement learning is necessary.

SIS code: NPFL122
Semester: winter
E-credits: 6
Examination: 2/2 C+Ex
Guarantor: Milan Straka

### Timespace Coordinates

• lecture: the lecture is held on Monday 15:40 in S4; first lecture is on Oct 07
• practicals: there are two parallel practicals, on Monday 17:20 in SU1 and on Tuesday 10:40 in SU2; first practicals are on Oct {07,08}

### Requirements

To pass the practicals, you need to obtain at least 80 points (excluding the bonus points), which are awarded for home assignments. Note that up to 40 points above 80 will be transfered to the exam.

To pass the exam, you need to obtain at least 60, 75 and 90 out of 100 points for the written exam (plus up to 40 points from the practicals), to obtain grades 3, 2 and 1, respectively.

The lecture content, including references to study materials.

The main study material is the Reinforcement Learning: An Introduction; second edition by Richard S. Sutton and Andrew G. Barto (reffered to as RLB). It is available online and also as a hardcopy.

References to study materials cover all theory required at the exam, and sometimes even more – the references in italics cover topics not required for the exam.

### 1. Introduction to Reinforcement Learning

• History of RL [Chapter 1 of RLB]
• Multi-armed bandits [Chapter 2 of RLB]

### 2. Markov Decision Process, Optimal Solutions, Monte Carlo Methods

• Markov Decision Process [Sections 3-3.3 of RLB]
• Policies and Value Functions [Sections 3.5-3.6 of RLB]
• Value Iteration [Sections 4 and 4.4 of RLB]
• Proof of convergence only in slides
• Policy Iteration [Sections 4.1-4.3 of RLB]
• Generalized Policy Iteration [Section 4.6 or RLB]
• Monte Carlo Methods [Sections 5-5.4 of RLB]

The tasks are evaluated automatically using the ReCodEx Code Examiner. The evaluation is performed using Python 3.6, TensorFlow 2.0.0, NumPy 1.17.2 and OpenAI Gym 0.14.0. For those using PyTorch, CPU version 1.2.0 is available.

You can install TensorFlow and Gym either to user packages using pip3 install --user tensorflow==2.0.0 gym==0.14.0 scipy box2d-py atari-py (with the last three backages being optinal dependencies of gym), or create a virtual environment using python3 -m venv VENV_DIR and then installing the packages inside it by running VENV_DIR/bin/pip3 install .... On Windows, you can use third-party precompiled versions of box2d-py.

### Teamwork

Working in teams of size 2 (or at most 3) is encouraged. All members of the team must submit in ReCodEx individually, but can have exactly the same sources/models/results. However, each such solution must explicitly list all members of the team to allow plagiarism detection using this template.

### multiarmed_bandits

Perform a parameter study of various approaches to solving multiarmed bandits. For every hyperparameter choice, perform 100 episodes, each consisting of 1000 trials, and report the average and standard deviation of the 100 episode returns.

Start with the multiarmed_bandits.py template, which defines MultiArmedBandits environment. We use API based on OpenAI Gym Environment class, notably the following two methods:

• reset() → new_state: starts a new episode
• step(action) → new_state, reward, done, info: perform the chosen action in the environment, returning the new state, obtained reward, a boolean flag indicating an end of episode, and additional environment-specific information Of course, the states are not used by the multiarmed bandits (None is returned).

Your goal is to implement the following modes of calculation. In addition to submitting the solution to ReCodEx, you should use multiarmed_bandits_draw.py to plots the results in a graph.

• greedy [2 points]: perform $ε$-greedy search with parameter epsilon, computing the value function using averaging. (Results for $ε ∈ \{1/64, 1/32, 1/16, 1/8, 1/4\}$ are plotted.)
• greedy and alpha$≠0$ [1 point]: perform $ε$-greedy search with parameter epsilon and initial function estimate of 0, using fixed learning rate alpha. (Results for $α=0.15$ and $ε ∈ \{1/64, 1/32, 1/16, 1/8, 1/4\}$ are plotted.)
• greedy, alpha$≠0$ and initial$≠0$ [1 point]: perform $ε$-greedy search with parameter epsilon, given initial value as starting value function and fixed learning rate alpha. (Results for initial$=1$, $α=0.15$ and $ε ∈ \{1/128, 1/64, 1/32, 1/16\}$ are plotted.)
• ucb [2 points]: perform UCB search with confidence level c and computing the value function using averaging. (Results for $c ∈ \{1/4, 1/2, 1, 2, 4\}$ are plotted.)
• gradient [2 points]: choose actions according to softmax distribution, updating the parameters using SGD to maximize expected reward. (Results for $α ∈ \{1/16, 1/8, 1/4, 1/2\}$ are plotted.)

### policy_iteration

Deadline: Nov 03, 23:59  5 points

Consider the following gridworld:

Start with policy_iteration.py, which implements the gridworld mechanics, by providing the following methods:

• GridWorld.states: return number of states (11)
• GridWorld.actions: return lists with labels of the actions (["↑", "→", "↓", "←"])
• GridWorld.step(state, action): return possible outcomes of performing the action in a given state, as a list of triples containing
• probability: probability of the outcome
• reward: reward of the outcome
• new_state: new state of the outcome

Implement policy iteration algorithm, with --steps steps of policy evaluation/policy improvement. During policy evaluation, use the current value function and perform --iterations applications of the Bellman equation. Perform the policy evaluation synchronously (i.e., do not overwrite the current value function when computing its improvement). Assume the initial policy is “go North” and initial value function is zero.

After given number of steps and iterations, print the resulting value function and resulting policy. For example, the output after 4 steps and 4 iterations should be:

    9.15→   10.30→   11.32→   12.33↑
8.12↑             3.35←    2.58←
6.95↑    5.90←    4.66←   -4.93↓


### monte_carlo

Deadline: Nov 03, 23:59  6 points

Solve the CartPole-v1 environment environment from the OpenAI Gym using the Monte Carlo reinforcement learning algorithm.

Use the supplied cart_pole_evaluator.py module (depending on gym_evaluator.py) to interact with the discretized environment. The environment has the following methods and properties:

• states: number of states of the environment
• actions: number of actions of the environment
• episode: number of the current episode (zero-based)
• reset(start_evaluate=False) → new_state: starts a new episode
• step(action) → new_state, reward, done, info: perform the chosen action in the environment, returning the new state, obtained reward, a boolean flag indicating an end of episode, and additional environment-specific information
• render(): render current environment state

Once you finish training (which you indicate by passing start_evaluate=True to reset), your goal is to reach an average return of 490 during 100 evaluation episodes. Note that the environment prints your 100-episode average return each 10 episodes even during training.

You can start with the monte_carlo.py template, which parses several useful parameters, creates the environment and illustrates the overall usage.

During evaluation in ReCodEx, three different random seeds will be employed, and you need to reach the required return on all of them. Time limit for each test is 5 minutes.