Introduction
Background
In March of 2016, Google’s AlphaGo smacked a Chinese Go master named Lee Sedol. This was unprecedented, and the underlying technology was Deep Q-Learning.
Plan
The last time we covered q-learning was towards the end of January. Today, we’re going to revisit the topic. Instead of a simple Q-learning algorithm, we’re going to advance into deep Q-learning.
First, we will review what we know about reinforcement learning and Q-learning. Second, we wil learn about deep Q-learning. Third, we’re going to look at a particular environment from OpenAI. Finally, we’re going to implement deep Q-learning with Keras.
Background
Review
remember the idea behind reinforcement learning
two key terms are policy and Q.
policy is a particular strategy…
q is an evaluation of that…
this is the idea behind q-learning
we’re still going to adhere to this structure.
Deep Q-Learning
Here’s the idea. Instead of a table of Q-values, the agent will use a neural network. The network will accept the state as the input and return an action to take.
Application
Scenario
After Gautom successfully designed his taxi cab AI, he starting goofing off during work. One of his stunts was riding on top of his taxi. Below, we can see a low-resolution image of him in action.
Our goal is really simple. As the taxi, we want to maximize the amount of time Gautom can stand on top of the taxi without him falling. Ideally, we’d like to be able to never drop him.
Components
Who’s the agent?
The environment?
What is the observation space?
What about the action space?
How can we dispense rewards?
Why might simple Q-learning not work?
Design
Recall the structure of a neural network.
The input layer with be the observation space, and the output layer will be the action space. Our network will have three hidden layers.
The first two hidden layers layers will deploy ReLU activation functions.
The final hidden layer will use a linear activiation function.
We’re going to use the Mean Squared Error (MSE) loss function.
Finally, for gradient descent, we’ll be using an Adam Optimizer. Studies have demonstrated that it generally is the best algorithm. If you’re interested, the procedure is below.
Implementation
The notebook can be found here.