4 minutes
Introdution to Dynamic Programming
Introduction
In this lecture, We are working with Planing means someone gives us how the environment works i.e. someone tells us rewards function and State probability transition
And we have MDPs means environment is fully observable.
Policy Evaluation
Now, we are given a policy $ \pi $ and our MDP = < S,A,P,R, $ \gamma $ >
and we need to find the value function for that policy
Now Acc. to Bellman expectation equation
$ v_{\pi}[s] = \sum \pi[a|s] * q_{\pi}[s, a] $
$ q_{\pi}[s, a] = R_s^a + \sum P^a_{s s^{'}} * v_{\pi}[s^{'}] $
$ v_{\pi}[s] = \sum \pi[a|s] * ( R_s^a + \sum P^a_{s s^{'}} * v_{\pi}[s^{'}]) $
Method - 1
we can find average reward function over our current policy, $ \pi $ , and average state transition probability over $ \pi $ then,
$ v_{\pi}[s] = R^{\pi}_s + \sum P^{\pi}_{ss^{'}} * v_{\pi}[s^{'}] $
Now, we can write the above equation in matrix form and then find value function, V, by using inversion of matrix.
Method - 2 :- Iterative Policy Evaluation
Here, we start with a random value function and improve it over many iteration by iteratively applying Bellman expectation equation.
$ v_{\pi}[s] = \sum \pi[a|s] * ( R_s^a + \sum P^a_{s s^{'}} * v_{\pi}[s^{'}]) $
Principle of optimality
It’s means that an optimal policy is one that
-
first take optimal action
-
and then follow optimal policy from successive state
A policy π(a|s) achieves the optimal value from state s, $ v_π(s) = v^∗(s) $, if and only if For any state s' reachable from s π achieves the optimal value from state s' i.e. $ v_π(s') = v∗(s') $
Policy Improvement Theorem
Let $\pi$ and $ \pi^{'} $ be two deterministic policy such that
$$ q_{\pi}(s,\pi^{'}[s]) >= v_{\pi}[s] $$
Then the policy π' must be as good as, or better than, π. i.e.
$ v_{\pi^{'}}[s] >= v_{\pi} $ for all state s
Policy Improvement
Here, we are given a policy and we need to improve it or from given policy we need to find optimal policy and corresponding value function.
We evaluate the current policy, $ \pi $, and find state value function using Iterativer policy evaluation and then greedily update our policy.
-
$ \pi^{'} = greedy(v_{\pi}) $
-
This process always converges to $ \pi_{*} $
by greedy we mean that we will choose action/state which will lead to maximum value function (action or state)
Let $ \pi $ be a deterministic policy
$ \pi^{'}[s] = max_{a} q_{\pi}[a|s] $
We can prove this will always be $ \pi^{'}[s] $ will always be better than $ \pi $
and
After several interation we will converge on $ \pi^{*} $
** we may also stop after some K iterative of policy improvement steps knows as k - convergence of value function **
Value Interation
Idea is we know the value funciton forany state then we can use Bellman optimality equation to get optimal value funtion to its preceding state.
Intition is to start from final/terminating state and work backwords.
$ V_*[s] = max_a(R_s^a + \sum P^a_{ss'} * V[s'] ) $
It can also be proven that it will also convege to optimal policy
Notice that here intermediate value function do not have any meaning means there may not exist a policy that will give a particular intermediate value function
Asynchronous DP
In Synchronous Dp we update all state value function at once, But in asynchronous DP we update individual states value function.
Helps to reduce commutation speed.
Will still lead to convergence if all states are choosen
In-place DP
Just keep updating on the go.
Priority based DP
Select state to update based on it’s priority.
Priority can be defined as the Bellman error.
$ error/priority = abs( max_a(R_s^a + \sum P^a_{ss'} * V[s'] ) -v[s]) $
Real Time DP
Get a real agent in the environment and update each state once it is visited.