# COMPGI13 Memo

## Lecture #1

Value-Based Policy-Based and Actor-Critic

Model-Based and Model-Free

Reinforcement Learning Problem vs Planning Problem

## Lecture #2

### MRP

Bellman Function and its Vectorized format

Solve bellman function directly by Linalg

Dense Large MRPs

Solvers

• Dynamic programming
• Monte-Carlo evalution
• Temporal-Difference learning

### MDP

Bellman E for $V^\pi$

Bellman E for $Q^\pi$

Concated $V^\pi$

Concated $Q^\pi$

## Lecture #3

### Policy Based

Steps:

1. Policy Evaluation
2. Policy Improvement

## Lecture #4

### MC Learning

where $G_t$ is actual measured value

### Temperal-Difference Learning

where $R{t+1} + \gamma V(S{t+1})$ is estimated value.

$R{t+1} + \gamma V(S{t+1})$ is called the TD target.

$\deltat = R{t+1} + \gamma V(S_{t+1}) - V(S_t)$ is called the TD error.

### Comparsion

• TD can learn before knowing the final outcome
• TD can learn online after every step.
• MC must wait until end of episode before return is known
• TD can learn without the final outcome
• TD can learn from incomplete sequences
• MC can only learn from complete sequences
• TD works in continuing ( non-terminating ) environments
• MC only works for episodic (terminating) environments
• MC has high var, zero bias
• Good convergence properties
• (even with function approximation)
• Not very sensitive to intial value
• Very simple to understand and use
• TD has low var, some bias
• Usually more efficient than MC
• TD(0) converges to $v_\pi (s)$
• (but not always withfunction approximation)
• More sensitive to intial value
• TD exploits Markov property
• Usually more effective in Markov environments
• MC does not exploit Markov property
• Usually more effective in non-Markov environments

## Lecture #5

### On and Off policy learning

• On
• Learn on the job
• Learn about policy $\pi$ from experience sampled from $\pi$
• Off
• Look over someones’ sholder
• Learn about policy $\pi$ from experience sampled from $\mu$

### Sarsa (On-policy)

Updateting Action-Value function:

### Off-Policy Learning

• Learning action pattern from Human or other Agent.
• Learning new strategy from old strategy.
• Learning optimal strategy from exploring.
• Learning multiple strategries from one.

#### Importance Sampling

##### Off-Policy MC Importance Sampling
• Sampling from $\mu$ to evaluate $\pi$
• Use similarity between distribution of $\mu$ and$\pi$ to weight $G_t$
• Update V model
• Would not work if $\mu=0$ but $\pi \ne 0$
• may increase var
##### TD-Policy MC Importance Sampling
• Weight for TD cost $R+\gamma V(S’)$
• Only need importance sampling once
• Lower var than MC
• Only need to evalute similarity of strategy for next step

#### Q-Learning

• No importance sampling needed
• Sampling next step action from Action Strategy $A_{t+1} \sim \mu(\dot{}|S_t)$
• Assume an alternative action is sampled from object strategy $A’ \sim \pi(\dot{}|S_t)$
• Use this alternative action to update Q function

It can improve action strategy and object strategy simutaneously

New assume $\pi$ is greedy strategy on $Q(s,a)$

Action strategy$\mu$ is $\epsilon-greedy$ strategy on $Q(s,a)$

Q-Learning ‘s object function can be reduced by

## Lecture #6 Value Function Approximation

Use abstract function to approximate Look up table, not only better compress rate, but also can generalized in special field.

• Linear
• Neural Network
• Decision Tree
• Nearest Neighbor
• Fourier/ Wavelet

DQN take advantage experience replay and fixed Q object function:

• take advantage of $\epsilon-greedy$ strategy to select an action $a_t$
• store transformation $(st,a_t,r{t+1},s_{t+1})$ to memory D
• Random sampling mini-batch $(s,a,r,s’)$ from D
• use old fitting model $w^-$ to compute Q
• optimize Q and objectivenss function MSE, it’s loss is

### DQN玩Atari

• End to end learning, from pixel to function $Q(s,a)$
• State $s$ is latest 4 frame stack
• output $a$ is 18 value for each keys
• Reward is score change value

## Lecture #7 Policy Gradient Methods

Different Value-Based method, model to strategy directly

RL Feature
Value-Based Learn value funcion, implied strategy
Policy-Based Learn strategy directly
Actor-Critic Learn strategy and value simutaneously

# CMU_10703

Skimming Lecture 1-5

### Model-based RL

• Easy to transfer
• if reward is sparse, can be used experience to work
• study like human

• Learn model first,then build value function, that might more error prone way

### 模型

Model:

1. Transfer function
2. Reward function

Distribution Model:

1. Transfer probability $T(s’|s,a)$

Sampling model:
Simulate model, generate data from some state.