Slides

Lecture

Introduction

In this lecture, We are working with Planing means someone gives us how the environment works i.e. someone tells us rewards function and State probability transition

And we have MDPs means environment is fully observable.

Policy Evaluation

Now, we are given a policy $ \pi $ and our MDP = < S,A,P,R, $ \gamma $ >

and we need to find the value function for that policy

Now Acc. to Bellman expectation equation

$ v_{\pi}[s] = \sum \pi[a|s] * q_{\pi}[s, a] $

$ q_{\pi}[s, a] = R_s^a + \sum P^a_{s s^{'}} * v_{\pi}[s^{'}] $

$ v_{\pi}[s] = \sum \pi[a|s] * ( R_s^a + \sum P^a_{s s^{'}} * v_{\pi}[s^{'}]) $

Method - 1

we can find average reward function over our current policy, $ \pi $ , and average state transition probability over $ \pi $ then,

$ v_{\pi}[s] = R^{\pi}_s + \sum P^{\pi}_{ss^{'}} * v_{\pi}[s^{'}] $

Now, we can write the above equation in matrix form and then find value function, V, by using inversion of matrix.

Method - 2 :- Iterative Policy Evaluation

Here, we start with a random value function and improve it over many iteration by iteratively applying Bellman expectation equation.

$ v_{\pi}[s] = \sum \pi[a|s] * ( R_s^a + \sum P^a_{s s^{'}} * v_{\pi}[s^{'}]) $

Principle of optimality

It’s means that an optimal policy is one that

  1. first take optimal action

  2. and then follow optimal policy from successive state

A policy π(a|s) achieves the optimal value from state s, $ v_π(s) = v^∗(s) $, if and only if For any state s' reachable from s π achieves the optimal value from state s' i.e. $ v_π(s') = v∗(s') $

Policy Improvement Theorem

Let $\pi$ and $ \pi^{'} $ be two deterministic policy such that

$$ q_{\pi}(s,\pi^{'}[s]) >= v_{\pi}[s] $$

Then the policy π' must be as good as, or better than, π. i.e.

$ v_{\pi^{'}}[s] >= v_{\pi} $ for all state s

Policy Improvement

Here, we are given a policy and we need to improve it or from given policy we need to find optimal policy and corresponding value function.

We evaluate the current policy, $ \pi $, and find state value function using Iterativer policy evaluation and then greedily update our policy.

  1. $ \pi^{'} = greedy(v_{\pi}) $

  2. This process always converges to $ \pi_{*} $

by greedy we mean that we will choose action/state which will lead to maximum value function (action or state)

Let $ \pi $ be a deterministic policy

$ \pi^{'}[s] = max_{a} q_{\pi}[a|s] $

We can prove this will always be $ \pi^{'}[s] $ will always be better than $ \pi $

and

After several interation we will converge on $ \pi^{*} $

** we may also stop after some K iterative of policy improvement steps knows as k - convergence of value function **

Value Interation

Idea is we know the value funciton forany state then we can use Bellman optimality equation to get optimal value funtion to its preceding state.

Intition is to start from final/terminating state and work backwords.

$ V_*[s] = max_a(R_s^a + \sum P^a_{ss'} * V[s'] ) $

It can also be proven that it will also convege to optimal policy

Notice that here intermediate value function do not have any meaning means there may not exist a policy that will give a particular intermediate value function

Asynchronous DP

In Synchronous Dp we update all state value function at once, But in asynchronous DP we update individual states value function.

Helps to reduce commutation speed.

Will still lead to convergence if all states are choosen

In-place DP

Just keep updating on the go.

Priority based DP

Select state to update based on it’s priority.

Priority can be defined as the Bellman error.

$ error/priority = abs( max_a(R_s^a + \sum P^a_{ss'} * V[s'] ) -v[s]) $

Real Time DP

Get a real agent in the environment and update each state once it is visited.