Slides

Lecture

What is Markov Decision Process(MDP)

MDP is a process where the environment is fully observable. i.e. current state fully characterizes the future process/outcomes

POMDP can be converted to MDP

$$ P[S_{t+1}|S_t] = P[S_{t+1}|S_t,S_{t-1},S_{t-2}..] $$

What is a Markov Process (MP)

Markov process/Markov chain is a sequence of states with markov properties

MP = < S,P >

S = set of states

P_{ss^{'}} = probability of transition from state s to s^{'}.

It’s a DFA

What is a Markov Reward Process(MRP)

MRP is a MP where there is a reward assosiated with every state. This is the immediate reward we get from this state.

MRP = < S,P,R,$ \gamma $ >

S = set of states

P_{ss^{'}} = probability of transition from state s to s^{'}.

R = Reward function

$ \gamma $ is discount factor

What’s our goal in MRP

$$ G_t = E[R_t + \gamma * R_{t+1} + {\gamma}^2 * R_{t+2} + … + ] $$

As we want to maximize the cummulative reward so G_t is to be maximized.

$ \gamma $ is used to

  1. Math becomes easy

  2. Able to take care of non terminating and MRP with loops

  3. we are not 100% certain that our model is corrcect

value function

$$ V[s] = E[ G_t | S_t = s ] $$

It’s tells us how much good our current state is ?

That if I drop you in this state what’s the rewards that you can get from here onwards.

BELLMAN EQUATION for MRP

$$ V[s] = E[ R_s + \gamma * V[s^{'}] | S_t = s] $$

$$ V[s] = R_s + \gamma * \sum P_{s s^{'}}V[s^{'}] $$

Whats’s a MDP

MDP is a MRP with decision/actions.

MDP = < S,P,R,$ \gamma $ , A >

S = set of states

A = set of actions

$ P_{ss^{'}}^a $ = probability of transition from state s to $ s^{'} $ if action a is taken

$ R_s^a $ = Reward we will get if we take action a from state s

$ \gamma $ is discount factor

Policy

we can have deterministic policy or stochastic policy.

Its tells the the probability to take action a from state s

$ \pi (a|s) = P[A_t = a | S_t = s] $

state value function

It tells us how good is a particular state is ?

$ V_{\pi}[s] = E_{\pi}[ G_t | S_t = s] $

action value function

It tells us how good is a particular action in a particular state is ?

$ q_{\pi}[s,a] = E_{\pi}[ G_t | S_t = s, A_t = a] $

BELLMAN EXPECTATION EQUATION

$ v_{\pi}[s] = \sum \pi[a|s] * q_{\pi}[s, a] $

$ q_{\pi}[s, a] = R_s^a + \sum P^a_{s s^{'}} * v_{\pi}[s^{'}] $

$ v_{\pi}[s] = \sum \pi[a|s] * ( R_s^a + \sum P^a_{s s^{'}} * v_{\pi}[s^{'}]) $

Optimal Policy

$ {pi}{*} $ is optimal if $ {pi}{*}[s] >= {pi}[s] $ for all s

There is always a optimal policy

Optimal policy leads to optimal value and state function.

BELLMAN OPTIMALITY EQUATIONS

$ v_{\pi}[s] = \max(q_{\pi}[s, a]) $

$ q_{\pi}[s, a] = R_s^a + \sum P^a_{s s^{'}} * v_{\pi}[s^{'}] $

$ v_{\pi}[s] = \max( R_s^a + \sum P^a_{s s^{'}} * v_{\pi}[s^{'}]) $