3 minutes
Markov Decision Process
What is Markov Decision Process(MDP)
MDP is a process where the environment is fully observable. i.e. current state fully characterizes the future process/outcomes
POMDP can be converted to MDP
$$ P[S_{t+1}|S_t] = P[S_{t+1}|S_t,S_{t-1},S_{t-2}..] $$
What is a Markov Process (MP)
Markov process/Markov chain is a sequence of states with markov properties
MP = < S,P >
S = set of states
P_{ss^{'}} = probability of transition from state s to s^{'}.
It’s a DFA
What is a Markov Reward Process(MRP)
MRP is a MP where there is a reward assosiated with every state. This is the immediate reward we get from this state.
MRP = < S,P,R,$ \gamma $ >
S = set of states
P_{ss^{'}} = probability of transition from state s to s^{'}.
R = Reward function
$ \gamma $ is discount factor
What’s our goal in MRP
$$ G_t = E[R_t + \gamma * R_{t+1} + {\gamma}^2 * R_{t+2} + … + ] $$
As we want to maximize the cummulative reward so G_t is to be maximized.
$ \gamma $ is used to
-
Math becomes easy
-
Able to take care of non terminating and MRP with loops
-
we are not 100% certain that our model is corrcect
value function
$$ V[s] = E[ G_t | S_t = s ] $$
It’s tells us how much good our current state is ?
That if I drop you in this state what’s the rewards that you can get from here onwards.
BELLMAN EQUATION for MRP
$$ V[s] = E[ R_s + \gamma * V[s^{'}] | S_t = s] $$
$$ V[s] = R_s + \gamma * \sum P_{s s^{'}}V[s^{'}] $$
Whats’s a MDP
MDP is a MRP with decision/actions.
MDP = < S,P,R,$ \gamma $ , A >
S = set of states
A = set of actions
$ P_{ss^{'}}^a $ = probability of transition from state s to $ s^{'} $ if action a is taken
$ R_s^a $ = Reward we will get if we take action a from state s
$ \gamma $ is discount factor
Policy
we can have deterministic policy or stochastic policy.
Its tells the the probability to take action a from state s
$ \pi (a|s) = P[A_t = a | S_t = s] $
state value function
It tells us how good is a particular state is ?
$ V_{\pi}[s] = E_{\pi}[ G_t | S_t = s] $
action value function
It tells us how good is a particular action in a particular state is ?
$ q_{\pi}[s,a] = E_{\pi}[ G_t | S_t = s, A_t = a] $
BELLMAN EXPECTATION EQUATION
$ v_{\pi}[s] = \sum \pi[a|s] * q_{\pi}[s, a] $
$ q_{\pi}[s, a] = R_s^a + \sum P^a_{s s^{'}} * v_{\pi}[s^{'}] $
$ v_{\pi}[s] = \sum \pi[a|s] * ( R_s^a + \sum P^a_{s s^{'}} * v_{\pi}[s^{'}]) $
Optimal Policy
$ {pi}{*} $ is optimal if $ {pi}{*}[s] >= {pi}[s] $ for all s
There is always a optimal policy
Optimal policy leads to optimal value and state function.
BELLMAN OPTIMALITY EQUATIONS
$ v_{\pi}[s] = \max(q_{\pi}[s, a]) $
$ q_{\pi}[s, a] = R_s^a + \sum P^a_{s s^{'}} * v_{\pi}[s^{'}] $
$ v_{\pi}[s] = \max( R_s^a + \sum P^a_{s s^{'}} * v_{\pi}[s^{'}]) $