Value Function

Match 10, 2024

The value function Vπ for a policy π measures the exptected return if you start in state s and follow policy π.

Vπ(s)Eπ[GtSt=s]=Eπ[k=0γtRt+kSt=s]

A (deterministic and stationary) policy π:SA specifies a decision-making strategy in which the agent chooses actions adaptively based on the current state, i.e., at=π(st). More generally, the agent may also choose actions acording to a stochastic policy π:SΔ(A), and with a slight abuse of notation we write atπ(st). A deterministic policy is its special case when π(s) is a point mass for all sS. The goal of the agent is to choose a policy π to maximize the expected discounted sum of rewards, or value:

E[t=1γt1rtπ,s1]

The expectation is with respect to the randomness of the trajectory, that is, the randomness is state transitions and the stochasticity of π. Notice that, since rt is nonnegative and upper bounded by Rmax, we have

0t=1γt1rtt=1γt1Rmax=Rmax1γ

Hence, the discounted of rewards (or the discounted return) along any actual trajectory is always bounded in range [0,Rmax1γ], and so is its expectation of any form. Note that for a fixed policy, its value may differ for different choice fo s1, and we define the value function VMπ:SR as

VMπ(s)=E[t=1γt1rtπ,s1=s]

which is the value obtained by following policy π starting at state s. Similarly we define the action-value (or Q-value) function QMπ:S×AR as

QMπ(s,a)=E[t=1γt1rtπ,s1=s,a1=a]

It’s often useful to know the value of a state, or state-action pair. By Value, we mean the expected return if you start in that state or state-action pair, and then act according to a particular policy forever after. Value functions are used, one way or another, in almost every RL algorithm.

There are four main functions of note here:

  1. The On-Policy Value Function, Vπ(s), which gives the expected return if you start in state s and always act according to policy π:

Vπ(s)=Eτπ[R(τ)s0=s]

  1. The On-Policy Action-Value Function, Qπ(s,a), which gives the expected return if you start in state s, take an arbitrary action a (which may not have come from the policy), and then forever after act according to policy π:

Qπ(s,a)=Eτπ[R(τ)s0=s,a0=a]

  1. The Optimal Value Function, V(s), which gives the expected return if you start in state s and always act according to the oprtimal policy in the environment:

V(s)=maxπEτπ[R(τ)s0=s]

  1. The Optimal Action-Value Function, Q(s,a), which gives the expected return if you start in state s, take an arbitrary action a, and then forever after according to the optimal policy in the environment:

Q(s,a)=maxπEτπ[R(τ)s0=s]

When we talk about value functions, if we do not make references to time-dependence, we only mean expected infinite-horizon discounted return. Value functions for finite-horizon undiscounted return would need to accept time as an argument. Can you think about why? Hint: what happens when time’s up?

There are two keys connections between the value function and the action-value function that come up pretty often:

Vπ(s)=Eaπ[Qπ(s,a)],

and

V(s)=maxaQ(s,a)

These relations follow pretty directly from the definitions just given: can you prove them?