Why DeepMind’s breakthrough ML algorithm was so tricky to train

In the past decade, Deep Reinforcement Learning has conquered games ranging from Go to StarCraft and Poker. This progress can be traced back to DeepMind’s DQN algorithm [1], the first time Deep Learning and Reinforcement Learning were successfully combined.

First published in 2013, Deep Q-Networks (DQN) learn to take actions that outperform humans in a suite of Atari games, taking only the pixel values of the game as input. Learning such effective behaviour directly from experience is a remarkable achievement. It’s the first hint at a viable path towards Artificial General Intelligence - the ability of computers to get intelligent across a wide range of tasks, similar or greater than humans are!

Part of the reason DQN was such an impressive advance is that Q-learning - the subtype of Reinforcement Learning used - is very hard to train. To understand why we’ll first look at how Q-Learning works under the hood. We’ll then explore why it’s so tricky, and dig into how DeepMind trained DQN successfully.

How Q-Learning works

DeepMind’s breakthrough combined neural networks with reinforcement learning (RL) for the first time. An RL agent interacts with an environment, receiving rewards if it takes favourable actions. The goal of an RL algorithm is to maximise the long-term sum of rewards it receives. Specifically, DQN uses a type of RL called Q-learning (hence the name DQN) - we’ll see what this means shortly.

This makes RL incredibly powerful. RL can be used to solve any decision-making task.

Figure 1: The interaction loop in RL between an agent and the environment. At each timestep, the agent selects an action, which changes the state of the environment. The environment also provides a reward signal to signify whether the agent’s behaviour is favourable or not. RL learns how to select actions to maximize the sum of future rewards.

Figure 1: The interaction loop in RL between an agent and the environment. At each timestep, the agent selects an action, which changes the state of the environment. The environment also provides a reward signal to signify whether the agent’s behaviour is favourable or not. RL learns how to select actions to maximize the sum of future rewards.