# Deep Reinforcement Learning – Winter 2020/21

In recent years, reinforcement learning has been combined with deep neural networks, giving rise to game agents with super-human performance (for example for Go, chess, or 1v1 Dota2, capable of being trained solely by self-play), datacenter cooling algorithms being 50% more efficient than trained human operators, or improved machine translation. The goal of the course is to introduce reinforcement learning employing deep neural networks, focusing both on the theory and on practical implementations.

Python programming skills and TensorFlow skills (or any other deep learning framework) are required, to the extent of the NPFL114 course. No previous knowledge of reinforcement learning is necessary.

SIS code: NPFL122
Semester: winter
E-credits: 6
Examination: 2/2 C+Ex
Guarantor: Milan Straka

### Timespace Coordinates

• lecture: the lecture is held on Monday 15:40; first lecture is on Oct 05
• practicals: the practicals take place on Wednesday 9:00; first practicals are on Oct 07

### Lectures

The lecture content, including references to study materials.

The main study material is the Reinforcement Learning: An Introduction; second edition by Richard S. Sutton and Andrew G. Barto (reffered to as RLB). It is available online and also as a hardcopy.

References to study materials cover all theory required at the exam, and sometimes even more – the references in italics cover topics not required for the exam.

### 1. Introduction to Reinforcement Learning

• History of RL [Chapter 1 of RLB]
• Multi-armed bandits [Sections 2-2.6 of RLB]
• Markov Decision Process [Sections 3-3.3 of RLB]

### 2. Markov Decision Process, Optimal Solutions, Monte Carlo Methods

• Policies and Value Functions [Sections 3.5-3.6 of RLB]
• Value Iteration [Sections 4 and 4.4 of RLB]
• Proof of convergence only in slides
• Policy Iteration [Sections 4.1-4.3 of RLB]
• Generalized Policy Iteration [Section 4.6 or RLB]
• Monte Carlo Methods [Sections 5-5.4 of RLB]

### 3. Temporal Difference Methods, Off-Policy Methods

• Model-free and model-based methods, using state-value or action-value functions [Chapter 8 before Section 8.1, and Section 6.8 of RLB]
• Temporal-difference methods [Sections 6-6.3 of RLB]
• Sarsa [Section 6.4 of RLB]
• Q-learning [Section 6.5 of RLB]
• Off-policy Monte Carlo Methods [Sections 5.5-5.7 of RLB]
• Expected Sarsa [Section 6.6 of RLB]
• N-step TD policy evaluation [Section 7.1 of RLB]
• N-step Sarsa [Section 7.2 of RLB]
• Off-policy n-step Sarsa [Section 7.3 of RLB]
• Tree backup algorithm [Section 7.5 of RLB]

### Requirements

To pass the practicals, you need to obtain at least 80 points, excluding the bonus points. Note that up to 40 points above 80 (including the bonus points) will be transfered to the exam.

### Environment

The tasks are evaluated automatically using the ReCodEx Code Examiner.

The evaluation is performed using Python 3.8, TensorFlow 2.3.1, TensorFlow Probability 0.11.1, NumPy 1.18.5, OpenAI Gym 0.17.2 and Box2D 2.3.10. You should install the exact version of these packages yourselves. For those using PyTorch, 1.6.0 is also available.

### Teamwork

Solving assignments in teams of size 2 or 3 is encouraged, but everyone has to participate (it is forbidden not to work on an assignment and then submit a solution created by other team members). All members of the team must submit in ReCodEx individually, but can have exactly the same sources/models/results. Each such solution must explicitly list all members of the team to allow plagiarism detection using this template.

### bandits

Deadline: Oct 20, 23:59  4 points

Implement the $ε$-greedy strategy for solving multi-armed bandits.

Start with the bandits.py template, which defines MultiArmedBandits environment, which has the following two methods:

• reset(): reset the environment
• step(action) → reward: perform the chosen action in the environment, obtaining a reward
• greedy(epsilon): return True with probability 1-epsilon

Your goal is to implement the following solution variants:

• alpha$=0$: perform $ε$-greedy search, updating the estimated using averaging.
• alpha$≠0$: perform $ε$-greedy search, updating the estimated using a fixed learning rate alpha.

Note that the initial estimates should be set to a given value and epsilon can be zero, in which case purely greedy actions are used.

Please note that the results are stochastic, so your results may differ slightly.

• python3 bandits.py --alpha=0 --epsilon=0.1 --initial=0
1.39 0.08

• python3 bandits.py --alpha=0 --epsilon=0 --initial=1
1.48 0.22

• python3 bandits.py --alpha=0.15 --epsilon=0.1 --initial=0
1.37 0.09

• python3 bandits.py --alpha=0.15 --epsilon=0 --initial=1
1.52 0.04


### monte_carlo

Deadline: Oct 20, 23:59  6 points

Solve the CartPole-v1 environment environment from the OpenAI Gym using the Monte Carlo reinforcement learning algorithm. The gym environments have the followng methods and properties:

• observation_space: the description of environment observations
• action_space: the description of environment actions
• reset() → new_state: starts a new episode
• step(action) → new_state, reward, done, info: perform the chosen action in the environment, returning the new state, obtained reward, a boolean flag indicating an end of episode, and additional environment-specific information
• render(): render current environment state

We additionaly extend the gym environment by:

• episode: number of the current episode (zero-based)
• reset(start_evaluation=False) → new_state: if start_evaluation is True, an evaluation is started

Once you finish training (which you indicate by passing start_evaluation=True to reset), your goal is to reach an average return of 490 during 100 evaluation episodes. Note that the environment prints your 100-episode average return each 10 episodes even during training.

Start with the monte_carlo.py template, which parses several useful parameters, creates the environment and illustrates the overall usage.

During evaluation in ReCodEx, three different random seeds will be employed, and you need to reach the required return on all of them. Time limit for each test is 5 minutes.

### policy_iteration

Deadline: Oct 27, 23:59  4 points

Consider the following gridworld:

Start with policy_iteration.py, which implements the gridworld mechanics, by providing the following methods:

• GridWorld.states: return number of states (11)
• GridWorld.actions: return lists with labels of the actions (["↑", "→", "↓", "←"])
• GridWorld.step(state, action): return possible outcomes of performing the action in a given state, as a list of triples containing
• probability: probability of the outcome
• reward: reward of the outcome
• new_state: new state of the outcome

Implement policy iteration algorithm, with --steps steps of policy evaluation/policy improvement. During policy evaluation, use the current value function and perform --iterations applications of the Bellman equation. Perform the policy evaluation synchronously (i.e., do not overwrite the current value function when computing its improvement). Assume the initial policy is “go North” and initial value function is zero.

Note that your results may sometimes be slightly different (for example because of varying floating point arithmetic on your CPU).

• python3 policy_iteration.py --gamma=0.95 --iterations=1 --steps=1
    0.00↑    0.00↑    0.00↑    0.00↑
0.00↑           -10.00←  -10.00↑
0.00↑    0.00→    0.10←  -79.90←

• python3 policy_iteration.py --gamma=0.95 --iterations=1 --steps=2
    0.00↑    0.00↑    0.00↑    0.00↑
0.00↑            -7.59←  -11.90←
0.00→    0.08←   -0.94←  -18.36←

• python3 policy_iteration.py --gamma=0.95 --iterations=1 --steps=3
    0.00↑    0.00↑    0.00↑    0.00↑
0.00↓            -5.86←   -7.41←
0.06↓    0.01←   -0.75←  -13.49↓

• python3 policy_iteration.py --gamma=0.95 --iterations=1 --steps=10
    0.04↓    0.04←    0.01↑    0.00↑
0.04↓            -0.95←   -1.00←
0.04↓    0.04←   -0.10→   -0.52↓

• python3 policy_iteration.py --gamma=0.95 --iterations=10 --steps=10
   11.79↓   11.03←   10.31←    6.54↑
12.69↓            10.14←    9.95←
13.56→   14.59→   15.58→   16.26↓

• python3 policy_iteration.py --gamma=1 --iterations=1 --steps=100
   66.54↓   65.53←   64.42←   56.34↑
67.68↓            63.58←   62.97←
68.69→   69.83→   70.84→   71.75↓


### policy_iteration_exact

Deadline: Oct 27, 23:59  2 points

Starting with policy_iteration_exact.py, extend the policy_iteration assignment to perform policy evaluation exactly by solving a system of linear equations.

Note that your results may sometimes be slightly different (for example because of varying floating point arithmetic on your CPU).

• python3 policy_iteration_exact.py --gamma=0.95 --steps=1
   -0.00→    0.00→    0.00↑    0.00↑
-0.00↑           -12.35←  -12.35↑
-0.85←   -8.10←  -19.62← -100.71←

• python3 policy_iteration_exact.py --gamma=0.95 --steps=2
    0.00→    0.00→    0.00→    0.00↑
0.00↑             0.00←  -11.05←
-0.00↑   -0.00→    0.00←  -12.10↓

• python3 policy_iteration_exact.py --gamma=0.95 --steps=3
   -0.00↓   -0.00←   -0.00↓   -0.00↑
-0.00↑             0.00←    0.69←
-0.00←   -0.00←   -0.00→    6.21↓

• python3 policy_iteration_exact.py --gamma=0.95 --steps=8
   12.12↓   11.37←    9.19←    6.02↑
13.01↓             9.92←    9.79←
13.87→   14.89→   15.87→   16.60↓

• python3 policy_iteration_exact.py --gamma=0.95 --steps=9
   12.24↓   11.49←   10.76←    7.05↑
13.14↓            10.60←   10.42←
14.01→   15.04→   16.03→   16.71↓

• python3 policy_iteration_exact.py --gamma=0.95 --steps=10
   12.24↓   11.49←   10.76←    7.05↑
13.14↓            10.60←   10.42←
14.01→   15.04→   16.03→   16.71↓


### policy_iteration_exploring_mc

Deadline: Oct 27, 23:59  2 points

Starting with policy_iteration_exploring_mc.py, extend the policy_iteration assignment to perform policy evaluation by using Monte Carlo estimation with exploring starts.

The estimation can now be performed model-free (without the access to the full MDP dynamics), therefore, the GridWorld.step returns a randomly sampled result instead of a full distribution.

Note that your results may sometimes be slightly different (for example because of varying floating point arithmetic on your CPU).

• python3 policy_iteration_exploring_mc.py --gamma=0.95 --seed=42 --steps=1
    0.00↑    0.00↑    0.00↑    0.00↑
0.00↑             0.00↑    0.00↑
0.00↑    0.00→    0.00↑    0.00↓

• python3 policy_iteration_exploring_mc.py --gamma=0.95 --seed=42 --steps=10
    0.00↑    0.00↑    0.00↑    0.00↑
0.00↑             0.00↑  -19.50↑
0.27↓    0.48←    2.21↓    8.52↓

• python3 policy_iteration_exploring_mc.py --gamma=0.95 --seed=42 --steps=50
    0.09↓    0.32↓    0.22←    0.15↑
0.18↑            -2.43←   -5.12↓
0.18↓    1.80↓    3.90↓    9.14↓

• python3 policy_iteration_exploring_mc.py --gamma=0.95 --seed=42 --steps=100
    3.09↓    2.42←    2.39←    1.17↑
3.74↓             1.66←    0.18←
3.92→    5.28→    7.16→   11.07↓

• python3 policy_iteration_exploring_mc.py --gamma=0.95 --seed=42 --steps=200
    7.71↓    6.76←    6.66←    3.92↑
8.27↓             6.17←    5.31←
8.88→   10.12→   11.36→   13.92↓


### policy_iteration_greedy_mc

Deadline: Oct 27, 23:59  2 points

Starting with policy_iteration_greedy_mc.py, extend the policy_iteration_exploring_mc assignment to perform policy evaluation by using $ε$-greedy Monte Carlo estimation.

For the sake of replicability, use the provided GridWorld.epsilon_greedy(epsilon, greedy_action) method, which returns a random action with probability of epsilon and otherwise returns the given greedy_action.

Note that your results may sometimes be slightly different (for example because of varying floating point arithmetic on your CPU).

• python3 policy_iteration_greedy_mc.py --gamma=0.95 --seed=42 --steps=1
    0.00↑    0.00↑    0.00↑    0.00↑
0.00↑             0.00→    0.00→
0.00↑    0.00↑    0.00→    0.00→

• python3 policy_iteration_greedy_mc.py --gamma=0.95 --seed=42 --steps=10
   -1.20↓   -1.43←    0.00←   -6.00↑
0.78→           -20.26↓    0.00←
0.09←    0.00↓   -9.80↓   10.37↓

• python3 policy_iteration_greedy_mc.py --gamma=0.95 --seed=42 --steps=50
   -0.16↓   -0.19←    0.56←   -6.30↑
0.13→            -6.99↓   -3.51↓
0.01←    0.00←    3.18↓    7.57↓

• python3 policy_iteration_greedy_mc.py --gamma=0.95 --seed=42 --steps=100
   -0.07↓   -0.09←    0.28←   -4.66↑
0.06→            -5.04↓   -8.32↓
0.00←    0.00←    1.70↓    4.38↓

• python3 policy_iteration_greedy_mc.py --gamma=0.95 --seed=42 --steps=200
   -0.04↓   -0.04←   -0.76←   -4.15↑
0.03→            -8.02↓   -5.96↓
0.00←    0.00←    2.53↓    4.36↓

• python3 policy_iteration_greedy_mc.py --gamma=0.95 --seed=42 --steps=500
   -0.02↓   -0.02←   -0.65←   -3.52↑
0.01→           -11.34↓   -8.07↓
0.00←    0.00←    3.15↓    3.99↓


### importance_sampling

Deadline: Nov 03, 23:59  2 points

Using the FrozenLake-v0 environment environment, implement Monte Carlo weighted importance sampling to estimate state value function $V$ of target policy, which uniformly chooses either action 1 (down) or action 2 (right), utilizing behaviour policy, which uniformly chooses among all four actions.

Start with the importance_sampling.py template, which creates the environment and generates episodes according to behaviour policy.

Note that your results may sometimes be slightly different (for example because of varying floating point arithmetic on your CPU).

• python3 importance_sampling.py --episodes=500
 0.00  0.00  0.00  0.00
0.03  0.00  0.00  0.00
0.22  0.14  0.29  0.00
0.00  0.50  1.00  0.00

• python3 importance_sampling.py --episodes=5000
 0.00  0.01  0.02  0.00
0.00  0.00  0.08  0.00
0.06  0.08  0.17  0.00
0.00  0.19  0.89  0.00

• python3 importance_sampling.py --episodes=50000
 0.02  0.01  0.04  0.01
0.03  0.00  0.06  0.00
0.08  0.17  0.24  0.00
0.00  0.34  0.78  0.00


### q_learning

Deadline: Nov 03, 23:59  6 points

Solve the MountainCar-v0 environment environment from the OpenAI Gym using the Q-learning reinforcement learning algorithm. Note that this task does not require TensorFlow.

The environment methods and properties are described in the monte_carlo assignment. Once you finish training (which you indicate by passing start_evaluation=True to reset), your goal is to reach an average return of -150 during 100 evaluation episodes.

You can start with the q_learning.py template, which parses several useful parameters, creates the environment and illustrates the overall usage. Note that setting hyperparameters of Q-learning is a bit tricky – I usualy start with a larger value of $ε$ (like 0.2 or even 0.5) an then gradually decrease it to almost zero.

During evaluation in ReCodEx, three different random seeds will be employed, and you need to reach the required return on all of them. The time limit for each test is 5 minutes.

### lunar_lander

Deadline: Nov 03, 23:59  7 points + 7 bonus

Solve the LunarLander-v2 environment environment from the OpenAI Gym. Note that this task does not require TensorFlow.

The environment methods and properties are described in the monte_carlo assignment, but include one additional method:

• expert_trajectory() → initial_state, trajectory This method generates one expert trajectory and returns a pair of initial_state and trajectory, where trajectory is a list of the tripples (action, reward, next_state). You can use this method only during training, not during evaluation.

To pass the task, you need to reach an average return of 0 during 100 evaluation episodes. During evaluation in ReCodEx, three different random seeds will be employed, and you need to reach the required return on all of them. Time limit for each test is 5 minutes.

The task is additionally a competition and at most 7 points will be awarded according to relative ordering of your solution performances.

You can start with the lunar_lander.py template, which parses several useful parameters, creates the environment and illustrates the overall usage.

### td_algorithms

Deadline: Nov 10, 23:59 Nov 17, 23:59  5 points

Starting with the td_algorithms.py template, implement all of the following $n$-step TD methods variants:

• SARSA, expected SARSA and Tree backup;
• either on-policy (with $ε$-greedy behaviour policy) or off-policy (with the same behaviour policy, but greedy target policy).

This assignment is new, so if you think you have implemented everything correctly and it does not pass, do not hesitate to write me.

Note that your results may sometimes be slightly different (for example because of varying floating point arithmetic on your CPU).

• python3 td_algorithms.py --mode=sarsa --n=1
Episode 100, mean 100-episode return -360.33 +-184.47
Episode 200, mean 100-episode return -224.77 +-98.00
Episode 300, mean 100-episode return -182.97 +-100.50
Episode 400, mean 100-episode return -138.60 +-96.61
Episode 500, mean 100-episode return -106.91 +-82.49
Episode 600, mean 100-episode return -88.05 +-79.26
Episode 700, mean 100-episode return -57.03 +-53.84
Episode 800, mean 100-episode return -33.09 +-54.11
Episode 900, mean 100-episode return -30.81 +-45.75
Episode 1000, mean 100-episode return -21.21 +-35.98

• python3 td_algorithms.py --mode=sarsa --n=1 --off_policy
Episode 100, mean 100-episode return -368.15 +-184.96
Episode 200, mean 100-episode return -207.59 +-101.46
Episode 300, mean 100-episode return -170.73 +-100.77
Episode 400, mean 100-episode return -143.05 +-97.48
Episode 500, mean 100-episode return -93.66 +-90.10
Episode 600, mean 100-episode return -66.25 +-68.43
Episode 700, mean 100-episode return -38.15 +-56.84
Episode 800, mean 100-episode return -25.82 +-44.67
Episode 900, mean 100-episode return -18.04 +-37.85
Episode 1000, mean 100-episode return -14.56 +-34.09

• python3 td_algorithms.py --mode=sarsa --n=4
Episode 100, mean 100-episode return -516.63 +-256.06
Episode 200, mean 100-episode return -205.93 +-160.51
Episode 300, mean 100-episode return -169.65 +-165.20
Episode 400, mean 100-episode return -68.71 +-131.53
Episode 500, mean 100-episode return -15.79 +-45.34
Episode 600, mean 100-episode return -8.01 +-38.65
Episode 700, mean 100-episode return -6.21 +-30.64
Episode 800, mean 100-episode return -5.69 +-16.12
Episode 900, mean 100-episode return 0.68 +-8.99
Episode 1000, mean 100-episode return -1.56 +-10.94

• python3 td_algorithms.py --mode=sarsa --n=4 --off_policy
Episode 100, mean 100-episode return -524.26 +-195.11
Episode 200, mean 100-episode return -345.41 +-181.73
Episode 300, mean 100-episode return -286.07 +-165.83
Episode 400, mean 100-episode return -249.51 +-187.19
Episode 500, mean 100-episode return -112.83 +-158.33
Episode 600, mean 100-episode return -80.56 +-145.49
Episode 700, mean 100-episode return -20.16 +-71.73
Episode 800, mean 100-episode return -17.42 +-62.07
Episode 900, mean 100-episode return -5.14 +-27.98
Episode 1000, mean 100-episode return -1.83 +-12.61

• python3 td_algorithms.py --mode=expected_sarsa --n=1
Episode 100, mean 100-episode return -361.33 +-186.01
Episode 200, mean 100-episode return -214.54 +-104.67
Episode 300, mean 100-episode return -179.69 +-103.63
Episode 400, mean 100-episode return -147.74 +-92.59
Episode 500, mean 100-episode return -109.10 +-89.53
Episode 600, mean 100-episode return -79.89 +-75.51
Episode 700, mean 100-episode return -59.05 +-57.01
Episode 800, mean 100-episode return -40.03 +-44.50
Episode 900, mean 100-episode return -25.21 +-38.41
Episode 1000, mean 100-episode return -19.67 +-34.80

• python3 td_algorithms.py --mode=expected_sarsa --n=1 --off_policy
Episode 100, mean 100-episode return -358.93 +-187.30
Episode 200, mean 100-episode return -221.93 +-91.20
Episode 300, mean 100-episode return -176.05 +-110.42
Episode 400, mean 100-episode return -124.69 +-92.91
Episode 500, mean 100-episode return -98.44 +-86.99
Episode 600, mean 100-episode return -64.75 +-69.56
Episode 700, mean 100-episode return -51.46 +-52.95
Episode 800, mean 100-episode return -28.69 +-44.46
Episode 900, mean 100-episode return -17.27 +-30.60
Episode 1000, mean 100-episode return -10.83 +-25.23

• python3 td_algorithms.py --mode=expected_sarsa --n=4
Episode 100, mean 100-episode return -555.15 +-204.64
Episode 200, mean 100-episode return -261.06 +-131.13
Episode 300, mean 100-episode return -144.66 +-157.24
Episode 400, mean 100-episode return -88.66 +-144.94
Episode 500, mean 100-episode return -25.55 +-69.55
Episode 600, mean 100-episode return -6.82 +-30.54
Episode 700, mean 100-episode return -2.32 +-18.24
Episode 800, mean 100-episode return -0.09 +-10.35
Episode 900, mean 100-episode return -0.06 +-14.05
Episode 1000, mean 100-episode return -0.28 +-11.60

• python3 td_algorithms.py --mode=expected_sarsa --n=4 --off_policy
Episode 100, mean 100-episode return -526.36 +-202.40
Episode 200, mean 100-episode return -306.17 +-167.38
Episode 300, mean 100-episode return -258.25 +-180.35
Episode 400, mean 100-episode return -146.21 +-174.19
Episode 500, mean 100-episode return -120.67 +-167.93
Episode 600, mean 100-episode return -85.25 +-153.25
Episode 700, mean 100-episode return -23.43 +-92.30
Episode 800, mean 100-episode return -21.92 +-70.71
Episode 900, mean 100-episode return -4.94 +-22.51
Episode 1000, mean 100-episode return -5.79 +-26.25

• python3 td_algorithms.py --mode=tree_backup --n=1
Episode 100, mean 100-episode return -361.33 +-186.01
Episode 200, mean 100-episode return -214.54 +-104.67
Episode 300, mean 100-episode return -179.69 +-103.63
Episode 400, mean 100-episode return -147.74 +-92.59
Episode 500, mean 100-episode return -109.10 +-89.53
Episode 600, mean 100-episode return -79.89 +-75.51
Episode 700, mean 100-episode return -59.05 +-57.01
Episode 800, mean 100-episode return -40.03 +-44.50
Episode 900, mean 100-episode return -25.21 +-38.41
Episode 1000, mean 100-episode return -19.67 +-34.80

• python3 td_algorithms.py --mode=tree_backup --n=1 --off_policy
Episode 100, mean 100-episode return -358.93 +-187.30
Episode 200, mean 100-episode return -221.93 +-91.20
Episode 300, mean 100-episode return -176.05 +-110.42
Episode 400, mean 100-episode return -124.69 +-92.91
Episode 500, mean 100-episode return -98.44 +-86.99
Episode 600, mean 100-episode return -64.75 +-69.56
Episode 700, mean 100-episode return -51.46 +-52.95
Episode 800, mean 100-episode return -28.69 +-44.46
Episode 900, mean 100-episode return -17.27 +-30.60
Episode 1000, mean 100-episode return -10.83 +-25.23

• python3 td_algorithms.py --mode=tree_backup --n=4
Episode 100, mean 100-episode return -522.36 +-226.08
Episode 200, mean 100-episode return -264.75 +-136.85
Episode 300, mean 100-episode return -163.50 +-168.74
Episode 400, mean 100-episode return -54.18 +-105.95
Episode 500, mean 100-episode return -27.66 +-70.12
Episode 600, mean 100-episode return -9.05 +-23.62
Episode 700, mean 100-episode return -4.76 +-31.53
Episode 800, mean 100-episode return -2.57 +-12.74
Episode 900, mean 100-episode return 0.58 +-12.08
Episode 1000, mean 100-episode return 1.17 +-9.07

• python3 td_algorithms.py --mode=tree_backup --n=4 --off_policy
Episode 100, mean 100-episode return -519.80 +-233.81
Episode 200, mean 100-episode return -302.58 +-123.70
Episode 300, mean 100-episode return -203.98 +-153.41
Episode 400, mean 100-episode return -95.12 +-136.49
Episode 500, mean 100-episode return -25.28 +-65.11
Episode 600, mean 100-episode return -4.79 +-19.20
Episode 700, mean 100-episode return -8.53 +-29.38
Episode 800, mean 100-episode return -5.13 +-19.44
Episode 900, mean 100-episode return -1.98 +-12.35
Episode 1000, mean 100-episode return -1.59 +-11.99


### q_learning_tiles

Deadline: Nov 10, 23:59 Nov 17, 23:59  4 points

Improve the q_learning task performance on the MountainCar-v0 environment environment using linear function approximation with tile coding. Your goal is to reach an average reward of -110 during 100 evaluation episodes.

The environment methods are described in the q_learning assignments, with the following changes:

• The state returned by the env.step method is a list containing weight indices of the current state (i.e., the feature vector of the state consists of zeros and ones, and only the indices of the ones are returned). The action-value function is therefore approximated as a sum of the weights whose indices are returned by env.step.
• The env.observation_space.nvec returns a list, where the $i$-th element is a number of weights used by first $i$ elements of state. Notably, env.observation_space.nvec[-1] is the total number of weights.

You can start with the q_learning_tiles.py template, which parses several useful parameters and creates the environment. Implementing Q-learning is enough to pass the assignment, even if both N-step Sarsa and Tree Backup converge a little faster. The default number of tiles in tile encoding (i.e., the size of the list with weight indices) is args.tiles=8, but you can use any number you want (but the assignment is solvable with 8).

During evaluation in ReCodEx, three different random seeds will be employed, and you need to reach the required return on all of them. The time limit for each test is 5 minutes.

### q_network

Deadline: Nov 10, 23:59 Nov 17, 23:59  6 points

Solve the CartPole-v1 environment environment from the OpenAI Gym using Q-learning with neural network as a function approximation.

You can start with the q_network.py template, which provides a simple network implementation in TensorFlow. Feel free to use PyTorch instead, if you like.

The continuous environment is very similar to a discrete one, except that the states are vectors of real-valued observations with shape env.observation_space.shape.

Use Q-learning with neural network as a function approximation, which for a given state returns state-action values for all actions. You can use any network architecture, but one hidden layer of several dozens ReLU units is a good start. Your goal is to reach an average return of 400 during 100 evaluation episodes.

During evaluation in ReCodEx, two different random seeds will be employed, and you need to reach the required return on all of them. Time limit for each test is 10 minutes (so you can train in ReCodEx, but you can also pretrain your network if you like).

### car_racing

Deadline: Nov 17, 23:59  8 points + 10 bonus

The goal of this competition is to use Deep Q Networks (and any of Rainbow improvements) on a more real-world CarRacing-v0 environment from the OpenAI Gym.

The supplied car_racing_environment.py provides the environment. It is continuous and states are RGB images of size $96×96×3$, but you can downsample them even more. The actions are also continuous and consist of an array with the following three elements:

• steer in range [-1, 1]
• gas in range [0, 1]
• brake in range [0, 1]; note that full brake is quite aggressive, so you might consider using less force when braking

Internally you should probably generate discrete actions and convert them to the required representation before the step call. The smallest set is probably left, right, gas, brake and no-op, but you can use a more fine-grained one if you like.

The environment also support frame skipping, which improves its performance (only some frames need to be rendered).

In ReCodEx, you are expected to submit an already trained model, which is evaluated on 15 different tracks with a total time limit of 15 minutes. If your average return is at least 300, you obtain 8 points. The task is also a competition and at most 10 points will be awarded according to relative ordering of your solution performances.

The car_racing.py template parses several useful parameters and creates the environment. Note that the car_racing_environment.py can be executed directly and in that case you can drive the car using arrows.

Also, you might want to use a vectorized version of the environment for training, which runs several individual environments in separate processes. The template contains instructions how to create it. The vectorized environment expects a vector of actions and returns a vector of observations, rewards, dones and infos. When one of the environments is done, it is immediately reset and state is the initial state of a new episode.

### reinforce

Deadline: Nov 24, 23:59  5 points

Solve the CartPole-v1 environment environment from the OpenAI Gym using the REINFORCE algorithm.

Your goal is to reach an average return of 490 during 100 evaluation episodes.

Start with the reinforce.py template, which provides a simple network implementation in TensorFlow. Feel free to use PyTorch instead, if you like.

During evaluation in ReCodEx, two different random seeds will be employed, and you need to reach the required return on all of them. Time limit for each test is 5 minutes.

### reinforce_baseline

Deadline: Nov 24, 23:59  4 points

This is a continuation of reinforce assignment.

Using the reinforce_baseline.py template, solve the CartPole-v1 environment environment using the REINFORCE with baseline algorithm.

Using a baseline lowers the variance of the value function gradient estimator, which allows faster training and decreases sensitivity to hyperparameter values. To reflect this effect in ReCodEx, note that the evaluation phase will automatically start after 200 episodes. Using only 200 episodes for training in this setting is probably too little for the REINFORCE algorithm, but suffices for the variant with a baseline.

Your goal is to reach an average return of 490 during 100 evaluation episodes.

During evaluation in ReCodEx, two different random seeds will be employed, and you need to reach the required return on all of them. Time limit for each test is 5 minutes.

### cart_pole_pixels

Deadline: Nov 24, 23:59  5 points + 6 bonus

The supplied cart_pole_pixels_environment.py generates a pixel representation of the CartPole environment as an $80×80$ image with three channels, with each channel representing one time step (i.e., the current observation and the two previous ones).

During evaluation in ReCodEx, three different random seeds will be employed, each with time limit of 10 minutes, and if you reach an average return at least 300 on all of them, you obtain 5 points. The task is also a competition and at most 6 points will be awarded according to relative ordering of your solution performances.

The cart_pole_pixels.py template parses several parameters and creates the environment. You are again supposed to train the model beforehand and submit only the trained neural network.

### paac

Deadline: Dec 01, 23:59  5 points

Solve the CartPole-v1 environment environment using parallel actor-critic algorithm, employing the vectorized environment described in car_racing assignment.

Your goal is to reach an average return of 450 during 100 evaluation episodes.

Start with the paac.py template, which provides a simple network implementation in TensorFlow. Feel free to use PyTorch instead, if you like.

During evaluation in ReCodEx, two different random seeds will be employed, and you need to reach the required return on all of them. Time limit for each test is 10 minutes.

### paac_continuous

Deadline: Dec 01, 23:59  6 points

Solve the MountainCarContinuous-v0 environment environment using parallel actor-critic algorithm with continuous actions. When actions are continuous, env.action_space is the same Box space as env.observation_space, offering:

• env.action_space.shape, which specifies the shape of actions (you can assume actions are always a 1D vector),
• env.action_space.low and env.action_space.high, which specify the ranges of the corresponding actions.

Your goal is to reach an average return of 90 during 100 evaluation episodes.

Start with the paac_continuous.py template, which provides a simple network implementation in TensorFlow. Feel free to use PyTorch instead, if you like.

During evaluation in ReCodEx, two different random seeds will be employed, and you need to reach the required return on all of them. Time limit for each test is 10 minutes.

### ddpg

Deadline: Dec 01, 23:59  7 points

Solve the Pendulum-v0 environment environment using deep deterministic policy gradient algorithm. The environment is continuous, states and actions are described at OpenAI Gym Wiki.

Your goal is to reach an average return of -200 during 100 evaluation episodes.

Start with the ddpg.py template, which provides a simple network implementation in TensorFlow. Feel free to use PyTorch instead, if you like.

During evaluation in ReCodEx, two different random seeds will be employed, and you need to reach the required return on all of them. Time limit for each test is 10 minutes.

### walker

Deadline: Dec 08, 23:59  8 points + 10 bonus

In this exercise exploring continuous robot control, try solving the BipedalWalker-v3 environment environment from the OpenAI Gym. The environment is continuous, states and actions are described at OpenAI Gym Wiki.

In ReCodEx, you are expected to submit an already trained model, which is evaluated with two seeds, each for 100 episodes with a time limit of 10 minutes. If your average return is at least 100, you obtain 8 points. The task is also a competition and at most 10 points will be awarded according to relative ordering of your solution performances.

You can start with the ddpg.py template, only set args.env to BipedalWalker-v3.

### walker_hardcore

Deadline: Dec 08, 23:59  10 bonus

As an extesnion of the walker assignment, try solving the BipedalWalkerHardcore-v3 environment environment from the OpenAI Gym.

The task is a competition only and at most 10 points will be awarded according to relative ordering of your solution performances. In ReCodEx, your solution will be evaluated with two seeds, each for 100 episodes with a time limit of 10 minutes. If your average return is at least 0, ReCodEx shows the solution as correct.

You can start with the ddpg.py template, only set args.env to BipedalWalkerHardcore-v3.

### sac_bonus

Deadline: Dec 15, 23:59  8 bonus

In this bonus-only exercise, try using the SAC algorithm to solve the Hopper environment, but using HopperBulletEnv-v0 from open-source PyBullet framework. Basic information about Gym interface is in PyBullet Quickstart Guide; generally you just need to import pybullet_envs, which will register the Gym environments.

You can install PyBullet by pip3 install pybullet. However, precompiled binaried are available for Linux only, on Windows the library must be compiled by a suitable Microsoft Visual C++ compiler.

In ReCodEx, you are expected to submit an already trained model, which is evaluated on the HopperBulletEnv-v0 environment with two seeds, each for 100 episodes with a time limit of 15 minutes. If your average return is at least 1000, you obtain 6 bonus points.

A template for the SAC algorithm will be made available.

### trace_algorithms

Deadline: Dec 15, 23:59  4 points

Starting with the trace_algorithms.py template, implement the following state value estimations:

• use $n$-step estimates for a given $n$;
• if requested, use eligibility traces with a given $λ$;
• allow off-policy correction using importance sampling, optionally clipping the importance sampling ratios by a given threshold.

### Install

• Installing to central user packages repository

You can install all required packages to central user packages repository using pip3 install --user tensorflow==2.3.1 tensorflow_probability==0.11.1 numpy==1.18.5 gym==0.17.2 box2d==2.3.10.

• Installing to a virtual environment

Python supports virtual environments, which are directories containing independent sets of installed packages. You can create the virtual environment by running python3 -m venv VENV_DIR followed by VENV_DIR/bin/pip3 install tensorflow==2.3.1 tensorflow_probability==0.11.1 numpy==1.18.5 gym==0.17.2 box2d==2.3.10.

### ReCodEx

• What files can be submitted to ReCodEx?

You can submit multiple files of any type to ReCodEx. There is a limit of 20 files per submission, with a total size of 20MB.

• What file does ReCodEx execute and what arguments does it use?

Exactly one file with py suffix must contain a line starting with def main(. Such a file is imported by ReCodEx and the main method is executed (during the import, __name__ == "__recodex__").

The file must also export an argument parser called parser. ReCodEx uses its arguments and default values, but is overwrites some of the arguments depending on the test being executed – the template should always indicate which arguments are set by ReCodEx and which are left intact.

• What are the time and memory limits?

The memory limit during evaluation is 1.5GB. The time limit varies, but should be at least 10 seconds and at least twice the running time of my solution.

### Competitions

• Which algorithms are allowed during competitions?

Unless stated otherwise, you can use any algorithm to solve the competition task at hand. However, the implementation should be either created by you or you should understand it fully (i.e., if you use for example PPO algorithm, you should know how it works and how it is implemented).

### Requirements

To pass the practicals, you need to obtain at least 80 points, excluding the bonus points. Note that up to 40 points above 80 (including the bonus points) will be transfered to the exam. In total, assignments for at least 120 points (not including the bonus points) will be available.

To pass the exam, you need to obtain at least 60, 75 and 90 out of 100-point exam, to obtain grades 3, 2 and 1, respectively. (PhD students with binary grades require 75 points.) The exam consists of five 20-point questions, which are randomly generated, but always cover the whole course. In addition, you can get at most 40 surplus points from the practicals and at most 10 points for community work (i.e., fixing slides or reporting issues) – but only the points you already have at the time of the exam count.