# COMPGI13 Memo

## Lecture #1

Value-Based Policy-Based and Actor-Critic

Model-Based and Model-Free

Reinforcement Learning Problem vs Planning Problem

## Lecture #2

### MRP

Bellman Function and its Vectorized format

Solve bellman function directly by Linalg

Dense Large MRPs

Solvers

- Dynamic programming
- Monte-Carlo evalution
- Temporal-Difference learning

### MDP

Bellman E for $V^\pi$

Bellman E for $Q^\pi$

Concated $V^\pi$

Concated $Q^\pi$

## Lecture #3

### Policy Based

Steps:

- Policy Evaluation
- Policy Improvement

### Value Based

## Lecture #4

### MC Learning

where $G_t$ is actual measured value

### Temperal-Difference Learning

where $R*{t+1} + \gamma V(S*{t+1})$ is estimated value.

$R*{t+1} + \gamma V(S*{t+1})$ is called the TD target.

$\delta*t = R*{t+1} + \gamma V(S_{t+1}) - V(S_t)$ is called the TD error.

### Comparsion

- TD can learn before knowing the final outcome
- TD can learn online after every step.
- MC must wait until end of episode before return is known

- TD can learn without the final outcome
- TD can learn from incomplete sequences
- MC can only learn from complete sequences
- TD works in continuing ( non-terminating ) environments
- MC only works for episodic (terminating) environments

- MC has high var, zero bias
- Good convergence properties
- (even with function approximation)
- Not very sensitive to intial value
- Very simple to understand and use

- TD has low var, some bias
- Usually more efficient than MC
- TD(0) converges to $v_\pi (s)$
- (but not always withfunction approximation)
- More sensitive to intial value

- TD exploits Markov property
- Usually more effective in Markov environments

- MC does not exploit Markov property
- Usually more effective in non-Markov environments

## Lecture #5

### On and Off policy learning

判断on-policy和off-policy的关键在于，你所估计的policy或者value-function 和 你生成样本时所采用的policy 是不是一样。如果一样，那就是on-policy的，否则是off-policy的。^{1}

- On
- Learn on the job
- Learn about policy $\pi$ from experience sampled from $\pi$

- Off
- Look over someones’ sholder
- Learn about policy $\pi$ from experience sampled from $\mu$

### Sarsa (On-policy)

Updateting Action-Value function:

### Off-Policy Learning

- Learning action pattern from Human or other Agent.
- Learning new strategy from old strategy.
- Learning optimal strategy from exploring.
- Learning multiple strategries from one.

#### Importance Sampling

##### Off-Policy MC Importance Sampling

- Sampling from $\mu$ to evaluate $\pi$
- Use similarity between distribution of $\mu$ and$\pi$ to weight $G_t$

- Update V model

- Would not work if $\mu=0$ but $\pi \ne 0$
- may increase var

##### TD-Policy MC Importance Sampling

- Weight for TD cost $R+\gamma V(S’)$
- Only need importance sampling once

- Lower var than MC
- Only need to evalute similarity of strategy for next step

#### Q-Learning

- No importance sampling needed
- Sampling next step action from Action Strategy $A_{t+1} \sim \mu(\dot{}|S_t)$
- Assume an alternative action is sampled from object strategy $A’ \sim \pi(\dot{}|S_t)$
- Use this alternative action to update Q function

It can improve action strategy and object strategy simutaneously

New assume $\pi$ is greedy strategy on $Q(s,a)$

Action strategy$\mu$ is $\epsilon-greedy$ strategy on $Q(s,a)$

Q-Learning ‘s object function can be reduced by

## Lecture #6 Value Function Approximation

Use abstract function to approximate Look up table, not only better compress rate, but also can generalized in special field.

可以选择的近似函数:

- Linear
- Neural Network
- Decision Tree
- Nearest Neighbor
- Fourier/ Wavelet
- …

### DQN memory reply

DQN take advantage experience replay and fixed Q object function:

- take advantage of $\epsilon-greedy$ strategy to select an action $a_t$
- store transformation $(s
*t,a_t,r*{t+1},s_{t+1})$ to memory D - Random sampling mini-batch $(s,a,r,s’)$ from D
- use old fitting model $w^-$ to compute Q
- optimize Q and objectivenss function MSE, it’s loss is
- stochastic gradient descent

### DQN玩Atari

- End to end learning, from pixel to function $Q(s,a)$
- State $s$ is latest 4 frame stack
- output $a$ is 18 value for each keys
- Reward is score change value

## Lecture #7 Policy Gradient Methods

Different Value-Based method, model to strategy directly

RL | Feature |
---|---|

Value-Based | Learn value funcion, implied strategy |

Policy-Based | Learn strategy directly |

Actor-Critic | Learn strategy and value simutaneously |

# CMU_10703

Skimming Lecture 1-5

## Lecture 6 Planning and learning: Dyna, Monte carlo tree search

### Model-based RL

Advantage:

- Easy to transfer
- if reward is sparse, can be used experience to work
- study like human
- take advantage for exploring

Disadvantage:

- Learn model first,then build value function, that might more error prone way

### 模型

Model:

- Transfer function
- Reward function

Distribution Model:

- Transfer probability $T(s’|s,a)$

Sampling model:

Simulate model, generate data from some state.