In recent years, reinforcement learning has been combined with deep neural networks, giving rise to game agents with super-human performance (for example for Go, chess, or 1v1 Dota2, capable of being trained solely by self-play), datacenter cooling algorithms being 50% more efficient than trained human operators, or improved machine translation. The goal of the course is to introduce reinforcement learning employing deep neural networks, focusing both on the theory and on practical implementations.
Python programming skills and TensorFlow skills (or any other deep learning framework) are required, to the extent of the NPFL114 course. No previous knowledge of reinforcement learning is necessary.
To pass the practicals, you need to obtain at least 80 points (excluding the bonus points), which are awarded for home assignments. Note that up to 40 points above 80 will be transfered to the exam.
To pass the exam, you need to obtain at least 60, 75 and 90 out of 100 points for the written exam (plus up to 40 points from the practicals), to obtain grades 3, 2 and 1, respectively.
The lecture content, including references to study materials.
The main study material is the Reinforcement Learning: An Introduction; second edition by Richard S. Sutton and Andrew G. Barto (reffered to as RLB). It is available online and also as a hardcopy.
References to study materials cover all theory required at the exam, and sometimes even more – the references in italics cover topics not required for the exam.
The tasks are evaluated automatically using the ReCodEx Code Examiner. The evaluation is performed using Python 3.6, TensorFlow 2.0.0, NumPy 1.17.2 and OpenAI Gym 0.14.0. For those using PyTorch, CPU version 1.2.0 is available.
You can install TensorFlow and Gym either to user packages using
pip3 install --user tensorflow==2.0.0 gym==0.14.0 scipy box2d-py atari-py
(with the last three backages being optinal dependencies of
or create a virtual environment using
python3 -m venv VENV_DIR and then installing
the packages inside it by running
VENV_DIR/bin/pip3 install ....
On Windows, you can use third-party precompiled versions of
Working in teams of size 2 (or at most 3) is encouraged. All members of the team must submit in ReCodEx individually, but can have exactly the same sources/models/results. However, each such solution must explicitly list all members of the team to allow plagiarism detection using this template.
Deadline: Oct 20, 23:59 8
Perform a parameter study of various approaches to solving multiarmed bandits. For every hyperparameter choice, perform 100 episodes, each consisting of 1000 trials, and report the average and standard deviation of the 100 episode returns.
reset() → new_state: starts a new episode
step(action) → new_state, reward, done, info: perform the chosen action in the environment, returning the new state, obtained reward, a boolean flag indicating an end of episode, and additional environment-specific information Of course, the states are not used by the multiarmed bandits (
Your goal is to implement the following modes of calculation. In addition to submitting the solution to ReCodEx, you should use multiarmed_bandits_draw.py to plots the results in a graph.
greedy[2 points]: perform -greedy search with parameter
epsilon, computing the value function using averaging. (Results for are plotted.)
alpha[1 point]: perform -greedy search with parameter
epsilonand initial function estimate of 0, using fixed learning rate
alpha. (Results for and are plotted.)
initial[1 point]: perform -greedy search with parameter
initialvalue as starting value function and fixed learning rate
alpha. (Results for
initial, and are plotted.)
ucb[2 points]: perform UCB search with confidence level
cand computing the value function using averaging. (Results for are plotted.)
gradient[2 points]: choose actions according to softmax distribution, updating the parameters using SGD to maximize expected reward. (Results for are plotted.)
Deadline: Nov 03, 23:59 5 points
Consider the following gridworld:
Start with policy_iteration.py, which implements the gridworld mechanics, by providing the following methods:
GridWorld.states: return number of states (
GridWorld.actions: return lists with labels of the actions (
["↑", "→", "↓", "←"])
GridWorld.step(state, action): return possible outcomes of performing the
actionin a given
state, as a list of triples containing
probability: probability of the outcome
reward: reward of the outcome
new_state: new state of the outcome
Implement policy iteration algorithm, with
--steps steps of policy
evaluation/policy improvement. During policy evaluation, use the current value
function and perform
--iterations applications of the Bellman equation.
Perform the policy evaluation synchronously (i.e., do not overwrite the current
value function when computing its improvement). Assume the initial policy is
“go North” and initial value function is zero.
After given number of steps and iterations, print the resulting value function and resulting policy. For example, the output after 4 steps and 4 iterations should be:
9.15→ 10.30→ 11.32→ 12.33↑ 8.12↑ 3.35← 2.58← 6.95↑ 5.90← 4.66← -4.93↓
Deadline: Nov 03, 23:59 6 points
states: number of states of the environment
actions: number of actions of the environment
episode: number of the current episode (zero-based)
reset(start_evaluate=False) → new_state: starts a new episode
step(action) → new_state, reward, done, info: perform the chosen action in the environment, returning the new state, obtained reward, a boolean flag indicating an end of episode, and additional environment-specific information
render(): render current environment state
Once you finish training (which you indicate by passing
reset), your goal is to reach an average return of 490 during 100
evaluation episodes. Note that the environment prints your 100-episode
average return each 10 episodes even during training.
You can start with the monte_carlo.py template, which parses several useful parameters, creates the environment and illustrates the overall usage.
During evaluation in ReCodEx, three different random seeds will be employed, and you need to reach the required return on all of them. Time limit for each test is 5 minutes.