Reinforcement Learning Classroom


Lecture #1

Value-Based Policy-Based and Actor-Critic

Model-Based and Model-Free

Reinforcement Learning Problem vs Planning Problem

Lecture #2


Bellman Function and its Vectorized format

Solve bellman function directly by Linalg

Dense Large MRPs


  • Dynamic programming
  • Monte-Carlo evalution
  • Temporal-Difference learning


Bellman E for $V^\pi$

Bellman E for $Q^\pi$

Concated $V^\pi$

Concated $Q^\pi$

Lecture #3

Policy Based


  1. Policy Evaluation
  2. Policy Improvement

Value Based

Lecture #4

MC Learning

where $G_t$ is actual measured value

Temperal-Difference Learning

where $R{t+1} + \gamma V(S{t+1})$ is estimated value.

$R{t+1} + \gamma V(S{t+1})$ is called the TD target.

$\deltat = R{t+1} + \gamma V(S_{t+1}) - V(S_t)$ is called the TD error.


  • TD can learn before knowing the final outcome
    • TD can learn online after every step.
    • MC must wait until end of episode before return is known
  • TD can learn without the final outcome
    • TD can learn from incomplete sequences
    • MC can only learn from complete sequences
    • TD works in continuing ( non-terminating ) environments
    • MC only works for episodic (terminating) environments
  • MC has high var, zero bias
    • Good convergence properties
    • (even with function approximation)
    • Not very sensitive to intial value
    • Very simple to understand and use
  • TD has low var, some bias
    • Usually more efficient than MC
    • TD(0) converges to $v_\pi (s)$
    • (but not always withfunction approximation)
    • More sensitive to intial value
  • TD exploits Markov property
    • Usually more effective in Markov environments
  • MC does not exploit Markov property
    • Usually more effective in non-Markov environments

Lecture #5

On and Off policy learning

判断on-policy和off-policy的关键在于,你所估计的policy或者value-function 和 你生成样本时所采用的policy 是不是一样。如果一样,那就是on-policy的,否则是off-policy的。1

  • On
    • Learn on the job
    • Learn about policy $\pi$ from experience sampled from $\pi$
  • Off
    • Look over someones’ sholder
    • Learn about policy $\pi$ from experience sampled from $\mu$

Sarsa (On-policy)

Updateting Action-Value function:

Off-Policy Learning

  • Learning action pattern from Human or other Agent.
  • Learning new strategy from old strategy.
  • Learning optimal strategy from exploring.
  • Learning multiple strategries from one.

Importance Sampling

Off-Policy MC Importance Sampling
  • Sampling from $\mu$ to evaluate $\pi$
  • Use similarity between distribution of $\mu$ and$\pi$ to weight $G_t$
  • Update V model
  • Would not work if $\mu=0$ but $\pi \ne 0$
  • may increase var
TD-Policy MC Importance Sampling
  • Weight for TD cost $R+\gamma V(S’)$
  • Only need importance sampling once
  • Lower var than MC
  • Only need to evalute similarity of strategy for next step


  • No importance sampling needed
  • Sampling next step action from Action Strategy $A_{t+1} \sim \mu(\dot{}|S_t)$
  • Assume an alternative action is sampled from object strategy $A’ \sim \pi(\dot{}|S_t)$
  • Use this alternative action to update Q function

It can improve action strategy and object strategy simutaneously

New assume $\pi$ is greedy strategy on $Q(s,a)$

Action strategy$\mu$ is $\epsilon-greedy$ strategy on $Q(s,a)$

Q-Learning ‘s object function can be reduced by

Lecture #6 Value Function Approximation

Use abstract function to approximate Look up table, not only better compress rate, but also can generalized in special field.


  • Linear
  • Neural Network
  • Decision Tree
  • Nearest Neighbor
  • Fourier/ Wavelet

DQN memory reply

DQN take advantage experience replay and fixed Q object function:

  • take advantage of $\epsilon-greedy$ strategy to select an action $a_t$
  • store transformation $(st,a_t,r{t+1},s_{t+1})$ to memory D
  • Random sampling mini-batch $(s,a,r,s’)$ from D
  • use old fitting model $w^-$ to compute Q
  • optimize Q and objectivenss function MSE, it’s loss is
  • stochastic gradient descent


  • End to end learning, from pixel to function $Q(s,a)$
  • State $s$ is latest 4 frame stack
  • output $a$ is 18 value for each keys
  • Reward is score change value

Lecture #7 Policy Gradient Methods

Different Value-Based method, model to strategy directly

RL Feature
Value-Based Learn value funcion, implied strategy
Policy-Based Learn strategy directly
Actor-Critic Learn strategy and value simutaneously


Skimming Lecture 1-5

Model-based RL


  • Easy to transfer
  • if reward is sparse, can be used experience to work
  • study like human
  • take advantage for exploring


  • Learn model first,then build value function, that might more error prone way



  1. Transfer function
  2. Reward function

Distribution Model:

  1. Transfer probability $T(s’|s,a)$

Sampling model:
Simulate model, generate data from some state.