2 minutes
Introduction to Reinforcement Learning
LINKS
Slides
Video Lecture Link
Short Explanations
Rewards
It is the central quantity that tells us how good are we doing at time t. Our job is to maximize this reward at the end of the episode.
It is beleived that all goals that be achieved by maximizing cumulative total reward.
History
$ H_t = O_1R_1A_1O_2R_2A_2….O_tR_tA_t $
State
state is a summary of history.
$ S_t = F(H_t) $
Agent State
Agent state is the information agent uses to predict next action.
Environment State
Environment state is the information environment uses to give rewards and observation.
Markov State aka Information State
If we have a markov state that means future is independent of past given the future.
$$ P[S_{t+1}|S_t] = P[S_{t+1}|S_t,S_{t-1},S_{t-2},S_{t-3}] $$
By defination :-
$H_t$ is always Markov State.
Environment State is always Markov State.
So for a manuvering helicopter, only position is not a markov as it is not sufficient to get future position but position + velocity + wind speed +direction is markov.
Fully Observable Environment aka MDP
Where environment is fully observable means we know environment state.
It’s a MDP because environment state is markov by defination
Policy
A policy is a maping from state to action
$$ \pi[S_t] = A_t $$
$$ \pi[a_t|s_t] = P[A_t=a_t|S_t=s_t] $$
Notice that for every state there is only one action every time. Because of the defination of state
Value Function
It tells how good a particular state is ?
As how good a state is also depends on the actions till now so value function also depends on policy
$$ v_{\pi}[s] = E_{\pi}[ R_{t} + y*R_{t+1} + y^2*R_{t+2} + … + | S_t = s ] $$
Model
It’s the prediction of how the environment behave.
$$ p^a_{s^{'}s} = P( S_{t+1} = s^{'} | A_t = a , S_t = s ) $$
$$ p^a_{s^{'}s} = E( R_{t+1} | A_t = a , S_t = s ) $$
Learning vs Planing
Learning :- We interact with environment and inprove our policy
Planing :- We know the environment, performs computation with environment without interacting with the actual environment