O'Reilly logo
live online training icon Live Online training

Reinforcement learning: Building recommender systems

enter image description here

Matt Kirk

Have you ever made a decision that seemed like a good idea at the time but then years later ended up being a complete mistake?

Reinforcement learning (RL) is all about making decisions that set you up for success now and then dynamically reacting to change as the decisions play out. If you work in an industry like finance, aerospace, cars, advertising, media, or social media, RL offers massive value, helping you make better portfolio decisions and better spend advertising dollars. RL can also benefit autonomous vehicles, automating processes, and much more.

Expert Matt Kirk digs into what you need to know to get started with RL. You’ll work your way through everything from Bellman equations and value iteration all the way up to deep Q networks, and you’ll leave with resources to continue learning on your own.

What you'll learn-and how you can apply it

By the end of this live, hands-on, online course, you’ll understand:

  • How to balance exploitation and exploration within a dynamic environment
  • The Gittins index and other ideas on how this works in practice
  • Trade-offs between model-free and model-based RL algorithms
  • How reinforcement learning relates to supervised learning
  • Model-free RL with value iteration (Bellman equations), Q learning, and deep Q networks (DQNs)

And you’ll be able to:

  • Build a simple model using value iteration to traverse a maze
  • Build a simplistic stock trader using Q learning
  • Play Breakout using a DQN
  • Apply value iteration, Q learning, and DQNs to dynamic updating problems

This training course is for you because...

  • You’re a data scientist with a background in supervised and unsupervised learning and want to learn reinforcement learning.
  • You’re a software engineer who wants to optimize an automated system over time using machine learning.


  • A basic understanding of supervised learning, classification, and regression
  • General knowledge of optimization theory, information theory, and algebra (useful but not required)
  • Experience with deep learning techniques applied to images (useful but not required)

Recommended preparation:

About your instructor

  • Matt Kirk is a data architect, software engineer, and entrepreneur based out of Seattle, WA.

    For years, he struggled to piece together his quantitative finance background with his passion for building software.

    Then he discovered his affinity for solving problems with data.

    Now, he helps multi-million dollar companies with their data projects. From diamond recommendation engines to marketing automation tools, he loves educating engineering teams about methods to start their big data projects.

    To learn more about how you can get started with your big data project (beyond taking this class), check out matthewkirk.com for tips.


The timeframes are only estimates and may vary according to how the class is progressing

  • Q&A
  • Quiz
  • Break (5 minutes)

Q learning (40 minutes)

  • Lecture: Value iteration with Bellman equations; rearranging value iteration to implement Q learning (learning the optimal action based on an expected Q value or terminal state value); Q learning scenarios (What to pick as a reward, what is a state, what is an action, and are the actions stochastic versus deterministic?)
  • Q&A
  • Quiz

Q-Trader using straight Q learning (25 minutes)

  • Lecture: Hand-coded states and actions; reward as Sharpe ratio; results
  • Hands-on exercise: Explore Q-Trader—try out different learning rates, different ways of increasing episode viewing, etc.
  • Break (5 minutes)

DQN (35 minutes)

  • Lecture: Other ways of learning Q; Q calculation using a neural net (variations of DQNs, including Double DQN); what neural nets are good at (convolutions, max pooling, dropouts, recurrent layers) and how to roll that into DQNs
  • Q&A
  • Quiz

Does a DQN work better than Q learning? (25 minutes)

  • Lecture: Demonstrating that the state is now amorphous, the action is still hand coded, and the reward is still the
  • same

sReinforcement learning (60 minutes)

  • Lecture: The reasons for RL (balance between exploration and exploitation, learn over time instead of all at once, learn the policy to utilize over just a value, and learn AI heuristics and plans); why now (Dota 2, AlphaGo, and other advancements); what RL is (autoregressive supervised learning, Bellman equations, Markov decision processes, state–action–reward–state–action (SARSA), and model-free versus modeled); current effective RL use (hedge funds, self-driving cars, games, and open AI)
  • Group discussion: What is RL suited for in your organization?; When would you want to use a model versus be model free?; When should you optimize the policy versus the end reward?
  • how results
  • Hands-on exercises: Explore dropouts, recurrent layers, max pooling, and convolutions

Wrap-up and Q&A (10 minutes)