Introduction

Background

In March of 2016, Google’s AlphaGo smacked a Chinese Go master named Lee Sedol. This was unprecedented, and the underlying technology was Deep Q-Learning.

Plan

The last time we covered q-learning was towards the end of January. Today, we’re going to revisit the topic. Instead of a simple Q-learning algorithm, we’re going to advance into deep Q-learning.

First, we will review what we know about reinforcement learning and Q-learning. Second, we wil learn about deep Q-learning. Third, we’re going to look at a particular environment from OpenAI. Finally, we’re going to implement deep Q-learning with Keras.

Background

Review

remember the idea behind reinforcement learning

two key terms are policy and Q.

policy is a particular strategy…

q is an evaluation of that…

this is the idea behind q-learning

we’re still going to adhere to this structure.

Deep Q-Learning

Here’s the idea. Instead of a table of Q-values, the agent will use a neural network. The network will accept the state as the input and return an action to take.

Application

Scenario

After Gautom successfully designed his taxi cab AI, he starting goofing off during work. One of his stunts was riding on top of his taxi. Below, we can see a low-resolution image of him in action.

Our goal is really simple. As the taxi, we want to maximize the amount of time Gautom can stand on top of the taxi without him falling. Ideally, we’d like to be able to never drop him.

Components

Who’s the agent?

The environment?

What is the observation space?

What about the action space?

How can we dispense rewards?

Why might simple Q-learning not work?

Design

Recall the structure of a neural network.

The input layer with be the observation space, and the output layer will be the action space. Our network will have three hidden layers.

The first two hidden layers layers will deploy ReLU activation functions.

The final hidden layer will use a linear activiation function.

We’re going to use the Mean Squared Error (MSE) loss function.

Finally, for gradient descent, we’ll be using an Adam Optimizer. Studies have demonstrated that it generally is the best algorithm. If you’re interested, the procedure is below.

Implementation

The notebook can be found here.