https://embed.notionlytics.com/wt/ZXlKd1lXZGxTV1FpT2lKaFpqYzFNMk16TURFeFpEYzBOR1V3T0dFM05UUmxOamcwWm1abVl6Sm1ZeUlzSW5kdmNtdHpjR0ZqWlZSeVlXTnJaWEpKWkNJNklsRjBaRGt4TVRWNGJVVk9aVlJaYm5BMWIxUkhJbjA9

Admin: Replit

1. Learning from Experience

When we are faced with a task, we rarely have a full model of how our actions will affect the state of the environment or what will get us rewards.

For example, if we're designing an agent to control an autonomous car, we can't perfectly predict the future state. This is because we don't have a perfect model of how other cars and pedestrians will move.

How the state evolves in response to our actions is defined by the state transition function. This can be stochastic (based on probabilities*) or deterministic (not based on probability). In the autonomous driving example above, we don't know the state transition function.

In these cases, we can learn from experience - by taking actions, seeing the result, and using it to improve future action selection.

∗ e.g. the outcome of a dice roll

Training & Testing in RL

2. Temporal-Difference Learning

Temporal-Difference (TD) Learning is the first Reinforcement Learning algorithm we’re introducing to you. It learns a value function estimate directly from interacting with the environment - without needing to know the state transition function. For now, we're still using the state transition function to predict the next state so we can pick between actions.

TD Learning is an ‘online’ learning algorithm. This means it learns after every step of experience. This is in contrast to algorithms that interact with the environment for some time to gather experience, then learn based on this experience.

Since the value function changes based on the policy (seen in Tutorial 2), it is used to predict the value function for a given policy, $\pi$.

Here's the basic logic of the algorithm:

  1. Take the policy to be evaluated as input $\pi$
  2. Initialise the value function with arbitrary values across states