Slides

Lecture

Introduction

In previous Lectures we dealt with MDPs means environment is fully observable means we know the model.

Means we know about transition probabilities and rewards.

But now we don’t know the model and we don’t know about transition probabilities and rewards.

In this lecture we are doing prediction means given a polcy we need to find value function for that policy.

Monte Carlo Learning (MC)

Idea is to find learn from complete episodes and find mean g_t for any state will be equal to it’s value function.

MC requires to go through episodes and thus MDP should be terminating.

Remember

$ V_{\pi}[s] = E[ G_t | S_t=s] $

First Visit MC

For evaluating state s

Let t be the first time we visit state s in any episode.

we will find $ g_t $ and find it’s mean over all episodes.

According to Law of large numbers means ~ Expectation value if we have seen very large number of examples.

Every-Visit MC

For evaluating state s

In any episode, every time we visit state s we find corresponding $ g_t $ and take mean over all these $ g_t $

Increamental MC

As we can find mean in an incremental fastion. So we can use this fact in above MCs

$ \mu_{k} = \mu_{k-1} + (1/N) * (x_k - \mu_{k-1}) $

But instead of using increamenting N in above equation we can use a constant, $ \alpha $ .

Intuition is that we don’t need to know what happened way back in the past so its better to forget about that. Using a constant makes us do so.

In MC, we are updating out value only after completing the whole episode. Ans this won’t work in non terminating MDPs.

Temporal Difference (TD)

In TD, we don’t wait for the episode to end. we Bootstrap and keeping improving our guess.

First we make a guess for current state (s) and we move a timestamp and move to succession state (s') and get a reward in the process. Now we want to update our guess for s using our current guess for s' and peice of actual reality that we experienced , reward,.

$ V[s] <– V[s] + \alpha * ( R_t + \gamma * V[s'] - V[s] ) $

$ \delta (t) = R_t + \gamma * V[S_{t+1}] - V[S_t] $

$ \alpha $ is kind of learning rate if we may

Bias Vs Variance

In the case of RL, variance now refers to a noisy, but on average accurate value estimate, whereas bias refers to a stable, but inaccurate value estimate. To make this more concrete, imagine a game of darts. A high-bias player is one who always hits close to the target, but is always consistently off in some direction. A high-variance player, on the other hand, is one who sometimes hits the target, and is sometimes off, but on average near the target.

MC vs TD

  1. MC is offline, means we need to wait for the end of the episode to update. Which is not a great idea as we may encounter something in between that could have lead to negative reward but in this case it did not. Now in MC we don’t account for the danger in between.

  2. TD is online means

        a) TD can learn from incomplete episodes, not MC.

        b) TD can learn from non terminating episodes, not MC.

  1. MC has lot of noise hence has high variance but low bias. But TD has low variance but high bias

  2. MC always converges to optimal value even with function approximation.

  3. TD also converges to optimal value but may not with function approximation.

  4. MC is less dependent on initial values than TD

  5. MC converges to solution with minimum RMS means with best fit to observed rewards. Also means that it does not exploit/take use of markov property

  6. TD converges to solution with most possible markov model based on observations in episodes. Also means that it does exploit/take use of markov property

TD( $ \lambda $ )

Lets first look at n step TD. In this instead of improving our guess after taking one step we improve after taking n steps

$$ g_t^n = R_{t} + \gamma * R_{t+1} + \gamma^2 * R_{t+1} + \gamma^{n-1} * R_{t+n-1} + \gamma^{n} * V[s_{t+n}] $$

$$ V[s] = V[s] + \alpha * (g_t^n - V[s]) $$

Now, different n leads to to different setting of $ \alpha $ and error and stuff.

We need a way to take best of all n or consider all n at once.

Hence, we can average $ g_t^n $ over all n

Forward view of TD( $ \lambda $ )

Here, we take a GP average of all $ g_t^n $

$$ g_t^{\lambda} = (1-\lambda) * \sum_{n=1}^{n=\inf} \lambda ^{n-1} * g_t^n $$

$$ V[s] = V[s] + \alpha * (g_t^{\lambda} - V[s]) $$

Now it is also forward looking that means we need to wait till the end of episode to update V[s].

Backward view of TD( $ \lambda $ )

Eligibility Criteria, $ E_t[s] $ includes :-

  1. Frequency Heuristic

  2. Recency Heuristic

$$ E^t[s] = \lambda * \gamma * E^{t-1}[s] + 1(S[t] == s) $$

$$ \delta_t = R_t + \gamma * V[S_{t+1} - V[S_t] $$

$$ V[s] = V[s] + \alpha * \delta_t * E_t[s] $$

** V[s] is to be updateed FOR EVERY STATE NOT JUST FOR S[t] **

$ \lambda $ is decay rate i.e. how far do we want our information to flow

It can be proven that backward view and farward view are actually completely same