Slides

Video Lecture Link

Short Explanations

Rewards

It is the central quantity that tells us how good are we doing at time t. Our job is to maximize this reward at the end of the episode.

It is beleived that all goals that be achieved by maximizing cumulative total reward.

History

$ H_t = O_1R_1A_1O_2R_2A_2….O_tR_tA_t $

State

state is a summary of history.

$ S_t = F(H_t) $

Agent State

Agent state is the information agent uses to predict next action.

Environment State

Environment state is the information environment uses to give rewards and observation.

Markov State aka Information State

If we have a markov state that means future is independent of past given the future.

$$ P[S_{t+1}|S_t] = P[S_{t+1}|S_t,S_{t-1},S_{t-2},S_{t-3}] $$

By defination :-

$H_t$ is always Markov State.

Environment State is always Markov State.

So for a manuvering helicopter, only position is not a markov as it is not sufficient to get future position but position + velocity + wind speed +direction is markov.

Fully Observable Environment aka MDP

Where environment is fully observable means we know environment state.

It’s a MDP because environment state is markov by defination

Policy

A policy is a maping from state to action

$$ \pi[S_t] = A_t $$

$$ \pi[a_t|s_t] = P[A_t=a_t|S_t=s_t] $$

Notice that for every state there is only one action every time. Because of the defination of state

Value Function

It tells how good a particular state is ?

As how good a state is also depends on the actions till now so value function also depends on policy

$$ v_{\pi}[s] = E_{\pi}[ R_{t} + y*R_{t+1} + y^2*R_{t+2} + … + | S_t = s ] $$

Model

It’s the prediction of how the environment behave.

$$ p^a_{s^{'}s} = P( S_{t+1} = s^{'} | A_t = a , S_t = s ) $$

$$ p^a_{s^{'}s} = E( R_{t+1} | A_t = a , S_t = s ) $$

Learning vs Planing

Learning :- We interact with environment and inprove our policy

Planing :- We know the environment, performs computation with environment without interacting with the actual environment

Exploration vs Exploitation

Prediction vs control