4 - Generalised Policy Iteration

https://embed.notionlytics.com/wt/ZXlKd1lXZGxTV1FpT2lJMU5XVmhOVGhrTURCbE5UYzBNVFV5WWpZeE1URTROamRpTXpFMVpqYzVZU0lzSW5kdmNtdHpjR0ZqWlZSeVlXTnJaWEpKWkNJNklsRjBaRGt4TVRWNGJVVk9aVlJaYm5BMWIxUkhJbjA9

Admin: Replit

Week 1 Summary

To recap, in Week 1, we…

Started by defining fundamental terms: Markov Decision Process (MDP), state $s$, action $a$, reward $r$, policy $\pi(\cdot)$ (pi), return $G_t$, discount factor $\gamma$ (gamma) and value $v(\cdot)$

See the cheatsheet if you forget any of these

RL_cheatsheet.pdf
Deduced the optimal policies by hand for a couple of basic, known MDPs (plane repair & flight paths)
Calculated the value function of a policy for routing planes to Hong Kong by interacting with the environment using the TD update rule. This didn’t rely on knowing the MDP reward function, but we did rely on knowing the transition function (to do the 1-step lookahead) plus the optimal policy.
- We saw how the optimal policy didn’t explore all the states (e.g. London was left with a value of 0)
Saw how a mixture of exploration and exploitation is necessary to maximise return in the presence of uncertain rewards.
Implemented TD-Learning with epsilon-greedy exploration to park Barry’s car & play Wild Tic-Tac-Toe.

How did you find Week 1? Please give feedback!! 🙏

Week 2 - Learning from Experience

We’re going to start this week by introducing a general framework that contextualises and explains what we achieved last week with Temporal Difference Learning.

This should help you get a better understanding of:

Why TD-Learning works
How policies and value functions are related
The optimality guarantees associated with TD-Learning and other similar learning algorithms