Value Function
The
A (deterministic and stationary) policy
The expectation is with respect to the randomness of the trajectory, that is, the randomness is state transitions and the stochasticity of
Hence, the discounted of rewards (or the discounted return) along any actual trajectory is always bounded in range
which is the value obtained by following policy
It’s often useful to know the
There are four main functions of note here:
- The
, , which gives the expected return if you start in state and always act according to policy :
- The
, , which gives the expected return if you start in state , take an arbitrary action (which may not have come from the policy), and then forever after act according to policy :
- The
, , which gives the expected return if you start in state and always act according to the oprtimal policy in the environment:
- The
, , which gives the expected return if you start in state , take an arbitrary action , and then forever after according to the optimal policy in the environment:
When we talk about value functions, if we do not make references to time-dependence, we only mean expected
There are two keys connections between the value function and the action-value function that come up pretty often:
and
These relations follow pretty directly from the definitions just given: can you prove them?