Value Function

Match 10, 2024

The $value function$ $V^{π}$ for a policy $π$ measures the exptected return if you start in state $s$ and follow policy $π$ .

$V^{π} (s) ≜ E_{π} [G_{t} ∣ S_{t} = s] = E_{π} [\sum_{k = 0}^{\infty} γ^{t} R_{t + k} ∣ S_{t} = s]$

A (deterministic and stationary) policy $π : S \to A$ specifies a decision-making strategy in which the agent chooses actions adaptively based on the current state, i.e., $a_{t} = π (s_{t})$ . More generally, the agent may also choose actions acording to a stochastic policy $π : S \to Δ (A)$ , and with a slight abuse of notation we write $a_{t} \sim π (s_{t})$ . A deterministic policy is its special case when $π (s)$ is a point mass for all $s \in S$ . The goal of the agent is to choose a policy $π$ to maximize the expected discounted sum of rewards, or value:

$E [\sum_{t = 1}^{\infty} γ^{t - 1} r_{t} ∣ π, s_{1}]$

The expectation is with respect to the randomness of the trajectory, that is, the randomness is state transitions and the stochasticity of $π$ . Notice that, since $r_{t}$ is nonnegative and upper bounded by $R_{max}$ , we have

$0 \leq \sum_{t = 1}^{\infty} γ^{t - 1} r_{t} \leq \sum_{t = 1}^{\infty} γ^{t - 1} R_{max} = \frac{R_{max}}{1 - γ}$

Hence, the discounted of rewards (or the discounted return) along any actual trajectory is always bounded in range $[0, \frac{R_{max}}{1 - γ}]$ , and so is its expectation of any form. Note that for a fixed policy, its value may differ for different choice fo $s_{1}$ , and we define the value function $V_{M}^{π} : S \to R$ as

$V_{M}^{π} (s) = E [\sum_{t = 1}^{\infty} γ^{t - 1} r_{t} ∣ π, s_{1} = s]$

which is the value obtained by following policy $π$ starting at state $s$ . Similarly we define the action-value (or Q-value) function $Q_{M}^{π} : S \times A \to R$ as

$Q_{M}^{π} (s, a) = E [\sum_{t = 1}^{\infty} γ^{t - 1} r_{t} ∣ π, s_{1} = s, a_{1} = a]$

It’s often useful to know the $value$ of a state, or state-action pair. By Value, we mean the expected return if you start in that state or state-action pair, and then act according to a particular policy forever after. $Value functions$ are used, one way or another, in almost every RL algorithm.

There are four main functions of note here:

The $On-Policy Value Function$ , $V^{π} (s)$ , which gives the expected return if you start in state $s$ and always act according to policy $π$ :

$V^{π} (s) = \underset{τ \sim π}{E} [R (τ) ∣ s_{0} = s]$

The $On-Policy Action-Value Function$ , $Q^{π} (s, a)$ , which gives the expected return if you start in state $s$ , take an arbitrary action $a$ (which may not have come from the policy), and then forever after act according to policy $π$ :

$Q^{π} (s, a) = \underset{τ \sim π}{E} [R (τ) ∣ s_{0} = s, a_{0} = a]$

The $Optimal Value Function$ , $V^{*} (s)$ , which gives the expected return if you start in state $s$ and always act according to the oprtimal policy in the environment:

$V^{*} (s) = max_{π} \underset{τ \sim π}{E} [R (τ) ∣ s_{0} = s]$

The $Optimal Action-Value Function$ , $Q^{*} (s, a)$ , which gives the expected return if you start in state $s$ , take an arbitrary action $a$ , and then forever after according to the optimal policy in the environment:

$Q^{*} (s, a) = max_{π} \underset{τ \sim π}{E} [R (τ) ∣ s_{0} = s]$

When we talk about value functions, if we do not make references to time-dependence, we only mean expected $infinite-horizon discounted return.$ Value functions for finite-horizon undiscounted return would need to accept time as an argument. Can you think about why? Hint: what happens when time’s up?

There are two keys connections between the value function and the action-value function that come up pretty often:

$V^{π} (s) = \underset{a \sim π}{E} [Q^{π} (s, a)],$

and

$V^{*} (s) = max_{a} Q^{*} (s, a)$

These relations follow pretty directly from the definitions just given: can you prove them?