Model Free Control

Introduction

In the last lecture we did model free prediction now we want to do model free control means we want to find the optimal policy.

If we go back to Lecture 3 :- Introduction to dynamic programming. In there the path we followed is that first we learnt how to do policy evaluation and find state value function and then take a greedy step to form new policy acc. to state value function.

We can follow the same steps here too.

Policy Iteration using Monte-Carlo

Problem 1:-

In Policy Improvement step :- we are using greedy policy improvement i.e.

$$ \pi[s] = argmax_a( R^a_s + P^a_{ss'}*v[s'] ) $$

But Problem is here we need the state probability distribution aka model.

So we will use action value function

$$ \pi[s] = argmax_a( Q[s,a] ) $$

Problem 2:-

If we keep on following greey policy, it is highly unlikely that we would over the whole action state space. And thats too bad as if we may explore, we may get a policy giving better value functions.

EXPLORING STARTS one way to solve this is to start from every state infinitely many times. This guarantees that all the states will be visited but It is not usually a good idea specially when learning from actual experience because it’s unrealistic to assume that we can start from any state. Often it is the case that we can start only from a small subset of states

So we will use e-greedy policy improvement in which with probability e we will choose randomly any action and otherwise we will go greedy.

It can be proven that this will converge to optimal policy.

Policy Evaluation :- MC on Action Valur

Policy Iteration :- e-greedy

We can improve the above algorithm by Improving policy afer every episode.

GLIE

Glie property say that

All state action pair must be explored infinitely many times.
Eventually policy must converge to optimal policy

we can use e = 1/K where K is the number of episode

Policy Iteration using TD with action value (or sarsa)

$$ Q[s,a] = Q[s,a] + \alpha * ( R + \gamma * Q[S',A'] - Q[s,a]) $$

Every timestep in a episode update Q[s,a] and then do e-greedy plicy improvement

There are conditions for the convergence of sarsa, But lets not care about them now

n-step sarsa

As we have n-step TD, in the same fastion we have n-step sarsa.

$$ q_t^{\lambda} = (1-\lambda) * \sum_{n=1}^{n=inf} (\lambda)^{n-1} * q_t^n $$

$$ Q[s,a] = Q[s,a] + \alpha * ( q_t^{\lambda} - Q[s,a]) $$

Backward View of n-step sarsa

The idea/intuition is similar to TD, But in sarsa we gonna work with action value function instead of state value function.

$$ \delta_t = R_t + \lamda * Q[S_{t+1},A_{t+1}] - Q[S_{t},A_{t}] $$

for every state action pair (s,a)

$$ E_t[s,a] = \lambda * \gamma * E[s,a] + 1(S_t==s and A_t==a) $$

$$ Q[s,a] = Q[s,a] + \alpha * \delta_t * E_t[s,a] $$

$ \lambda $ decides how much back do we want our information to flow

Whatever we have seen so far in this lecture in on policy learning means we are learning and improving from the same policy

Off-policy Learning

In here, we have a target policy, $\pi$, that we want to improve and a behaviour policy, $ \mu $, from which we want to choose our actions.

Method 1 :- Important Sampling

But this is neither good in MC nor in TD as it has high variance.

Method 2 :- Q Learning

Next action is chosen using behaviour policy $ A_{t+1} ∼ \mu(·|S_t) $
But we consider alternative successor action A' according to our policy that we want to learn, $\pi$.

Basically idea is to behave acc. to behaviour policy but update acc. to actual policy.

$$ Q[S_t,A_t] = Q[S_t,A_t] + \alpha * ( R_{t+1} + \lambda * Q[S_{t+1},A'] - Q[S_t,A_t]) $$

Off policy control using Q sampling

Our target policy, $\pi$, is greedy wrt Q.
Our behaviour policy, $\mu$, is e-greedy wrt Q.
This will improve both target and behaviour policy.

$$ Q[S_t,A_t] = Q[S_t,A_t] + \alpha * ( R_{t+1} + \lambda * max_{A'}Q[S_{t+1},A'] - Q[S_t,A_t]) $$

Again it can be proven that this will converge to optimal policy

Q-Learning vs SARSA

Q-Learning is directly learning optimal policy while sarsa leads to a near optimal policy If we want to get optimal policy using sarsa, we need to decrease $ \epsilon $ continously, meaning $ Lim_{t->inf} \epsilon = 0 $.
Sarsa is more conservative than Q-Learning means Q-Learning will lead to a optimal but dangerous policy as opposed to sarsa which will lead to a near optimal but conservative policy.
Q-Learning has hgher variance as compared to sarsa.