Reinforcement Learning: Building Recommender Systems
Have you ever made a decision that seems like a good idea at the time? And then years later ended up being a complete mistake? Has it gone the other way? Where you make a mistake now only for it to turn into something good later on?
This thought provoking idea is what has led to the field of Reinforcement Learning. The decisions a player makes in chess might make sense for the next move but cost the player the game.
How do we make decisions now that set us up for success over time and dynamically react to changes as they play out? That is what RL is all about!
If you work in an industry that is dynamic (finance, aerospace, cars, advertising, media, social media), RL can bring massive value to you. Having the ability to learn how to operate within an environment can lead to making better portfolio decisions, spending advertising dollars better, autonomous vehicles, automating processes, and much more.
This class we will delve into what you need to know about RL to get started. Starting from the bellman equations and value iteration up to deep qnetworks as well as some resources for you to learn more after the class.
What you'll learnand how you can apply it
By the end of this live, handson, online course, you’ll understand:
 Balancing exploitation and exploration within a dynamic environment. We will introduce the concept of the gittins index as well as other ideas on how this works in practice.
 The tradeoff between modelfree and model based reinforcement learning algorithm
 The connection of reinforcement learning and supervised learning
 Value Iteration (Bellman equations), QLearning, and DQNs to be used for modelfree reinforcement learning. QLearning is an algorithm that learns the long term estimated reward for taking an action in a given state. DQN, or Deep QNetwork utilizes a neural net to assign value to actions in a given state (sometimes this is an image).

And you’ll be able to:

Build a simple model using value iteration to traverse a maze
 Build a simplistic stock trader using QLearning
 Play the game of breakout using a DQN.
 Apply Value Iteration, QLearning, and DQNs to dynamic updating problems you face at work. Whether it’s trading stocks, choosing advertisements to serve up, or automating processes with an optimal policy.
This training course is for you because...
 You are a data scientist with a background in supervised and unsupervised learning and want to learn reinforcement learning. For the data scientist who is tired of only classifying things in a point in time instead of over time.
 You are a software engineer who wants to optimize an automated system over time using machine learning.
Prerequisites
 An introduction to supervised learning.
 Understanding of what a classification and a regression is.
 A basic knowledge of optimization theory, information theory, and algebra would be helpful.
 A background in some of the deep learning techniques applied to images is also useful.
Recommended preparation:
 Supervised Learning (video)
 A Practical Introduction to Machine Learning (live online training)
About your instructor

Matt Kirk is a data architect, software engineer, and entrepreneur based out of Seattle, WA.
For years, he struggled to piece together his quantitative finance background with his passion for building software.
Then he discovered his affinity for solving problems with data.
Now, he helps multimillion dollar companies with their data projects. From diamond recommendation engines to marketing automation tools, he loves educating engineering teams about methods to start their big data projects.
To learn more about how you can get started with your big data project (beyond taking this class), check out matthewkirk.com for tips.
Schedule
The timeframes are only estimates and may vary according to how the class is progressing
Introduction (10 minutes)
Reinforcement Learning (40 minutes).
 Presentation: Why Reinforcement Learning?
 Balance between exploration and exploitation
 Learn over time instead of all at once.
 Learning the policy to utilize over just a value.
 A way to learn AI heuristics, and plans.
 Presentation: Why now? Dota2, AlphaGo, and the other advancements
 Presentation: What exactly is Reinforcement Learning?
 Autoregressive supervised learning
 Bellman equations
 Markov Decision Processes.
 SARSA
 ModelFree vs Modeled RL.
 Presentation: Who is currently using RL effectively?
 Hedgefunds
 Selfdriving cars
 Games
 OpenAI
 Q&A
 Quiz: (5 minutes)
 Break (5 minutes)
Discussions: (5 minutes)
 What would RL be suited for in their organization?
 When would someone want to use a Model vs be Model Free?
 When should someone optimize the policy vs the end reward?
QLearning (35 minutes)
 Lecture: Value iteration with the bellman equations.
 Lecture: Rearrangement of Value Iteration to be QLearning. Or learning the optimal action based on an expected Q value or value at the terminal state.
 Lecture: Walk through of multiple Q learning scenarios
 What to pick as a reward?
 What is a state?
 What is an action?
 Are the actions stochastic vs deterministic?
 Q&A
 Quiz: Test recollection of QLearning (5 minutes)
QTrader using straight QLearning (510 minutes)
Demonstration:
 Walk through hand coded states
 Walk through hand coded actions
 Determine reward as Sharpe Ratio.
 Show my results
 Break (5 minutes)
Lab (20 minutes to work on QTrader).
 The goal is to fill in the dots, and to do better than what I did.
 Some hints would be trying out different learning rates, different ways of increasing episode viewing, or others.
DQN (3035 minutes)
 Lecture: Instead of a tabular way of representing Q can we learn it using something else?
 Lecture: Calculate Q using a neural net. Walk through the variations of DQNs including the double DQN.
 Lecture: What are neural nets good at? How can we roll that into DQNs?
 Convolutions
 Max Pooling
 Dropouts
 Recurrent Layers
 Q&A
 Quiz: State check of knowledge (5 minutes)
 Break (5 minutes)
Does a DQN work better than QLearning? (10 minutes)
Demonstration:
 The state is now amorphous
 The action is still hand coded
 The reward is still the same.
 Show my results.
Lab (20 minutes)
 Dropouts
 Recurrent layers
 Maxpooling
 Convolutions
 etc.
Wrapup and Conclusion (10 minutes)
 Put up all the algorithms shown before
 There is always more to learn. A2C, AlphaZero, tdlambda. DQN is only scratching the surface.
 Q&A