Tabular RL for Value Prediction

The value Prediction Problem

Given a policy $π$ , we want to learn $V^{π}$ or $Q^{π}$
Why is it useful? Recall that if we know how to compute $Q^{π}$ , we can run policy iteration
On-policy learning: data is generated by $π$
off-policy learning: data is generated by some other policy
When action is always chosen by a fixed policy, the MDP reduces to a Markov chain plus a reward function over states, also known as Markov Reward Processes (MRP)

Monte-Carlo Value Prediction

If we can roll out trajectories from any starting state that we want, here is a simple procedure
For each $s$ , roll out $n$ trajectories using policy $π$
- For episodic tasks, roll out until termination
- For continuing tasks, roll out to a length (typically $H = O (1 / (1 - γ))$ ) such that omitting future rewards has minimal impact (small truncation error)
- Let ${\hat{V}}^{π} (s)$ (will just write $V (s)$ ) be the average discounted return
Also works if we can draw starting state from an exploratory initial distribution (i.e., one that assigns non-zero probability to every state)
Keep generating trajectories until we have enough data points for each starting state

Implementing MC in an online manner

The previous procedure assumes that we collect all the data, store them, and then process them (batch-mode learning)
Can we process each data point as they come, without ever needing to store them? (online, one-pass algorithm)

$\begin{aligned} For & i = 1, 2, . . . \\ Draw a starting state s_{i} from the exploratory initial distribution, \\ roll out a trajectory using π from s_{i} and let G_{i} be the (random) \\ discounted return \\ Let n (s_{i}) be the number of times s_{i} has appeared as an initial \\ state. If n (s_{i}) = 1 (first time seeing this state), let V (s_{i}) \leftarrow G_{i} \\ Otherwise, V (s_{i}) \leftarrow \frac{n (s_{i}) - 1}{n (s_{i})} V (s_{i}) + \frac{1}{n (s_{i})} G_{i} \\ Verify: at any point, V (s) is always the MC estimation using \\ trajectories starting from s available so far \end{aligned}$

More generally, $V (s_{i}) \leftarrow (1 - α) V (s_{i}) + α G_{i}$
$α$ is known as the step size or the learning rate
in theory, convergence require sum of $α$ goes to infinity while sum of $α^{2}$ -stays finite; in practice, constant small $α$ is often used
$G_{i}$ is often called the $target$
The expected value of the target is what we want to update our estimate to, but since it’s noisy, we only move slightly to it
Alternative expression: $V (s_{i}) \leftarrow V (s_{i}) + α (G_{i} - V (s_{i}))$
Moving the estimate in the direction of error (= target - current)
Can be interpreted as stochastic gradient descent
If we have i.i.d. real random variables $v_{1}, v_{2}, . . ., v_{n}$ the average is the solution of the least-square optimization problem:

$min_{v} \frac{1}{2 n} \sum_{i = 1}^{n} (v - v_{i})^{2}$

Stochastic gradient: $v - v_{i}$ (for uniformly random $i$ )

Every-visit Monte-Carlo

Supose we have a continuing task. What if we cannot set the starting state arbitrarily?
Let’s say we only have one single long trajectory

$s_{1}, a_{1}, r_{1}, s_{2}, a_{2}, r_{2}, s_{3}, a_{3}, r_{3}, s_{4}, . . .$

(By long trajectory, we mean trjectory length >> effective horizon $H = O (1 / (1 - γ))$ )
On-policy: $a_{t} \sim π (s_{t})$ , where $π$ is the policy we want to evaluate
Algorithm: for each $s$ , find all $t$ such that $s_{t} = s$ , calculate the discounted sum of rewards between time step $t$ and $t + H$ , and take a average over them as $V (s_{i})$
Convergence requires additional assumptions: the Markov chain induced by $π$ is ergodic–implying that all satate will be hit infinitely often if the trajectory length grows to infinity

Every-visit Monte-Carlo

You can use this idea to improve the algorithm when we can choose the starting state & the MDP is episodic
i.e., obtain a random return for each state visited on the trajectory
What if a state occurs multiple times on a trajectory?
Approach 1: only the 1st occurrence is used (first-visit MC)
Approach 2: all of them are used (every-visit MC)

Alternative Approach: TD(0)

Again suppose we have a single long trajectory

$s_{1}, a_{1}, r_{1}, s_{2}, a_{2}, r_{2}, s_{3}, a_{3}, r_{3}, s_{4}, . . .$

in a contunuing task

$TD(0): for t = 1, 2, 3, . . ., V (s_{t}) \leftarrow V (s_{t}) + α (r_{t} + γ V (s_{t + 1}) = V (s_{t}))$
TD = temperal difference
$r_{t} + γ V (s_{t + 1}) - V (s_{t}) :$ TD-error
The same structure as the MC update rule, except that we are using a different targert here: $r_{t} + γ V (s_{t + 1})$
Often callled bootstrapped target: the target value depends on our current estimated value function $V$
Conditioned on $s_{t}$ , what is the expected value of the target (taking expectation over the randomness of $r_{t}, s_{t + 1}$ )?
It’s $(T^{π} V) (s_{t})$

Understanding TD(0)

$V (s_{t}) \leftarrow V (s_{t}) + α (r_{t} + γ V (s_{t + 1}) - V (s))$
Imagine a slightly different procedure
Initialize $V$ and $V^{'}$ arbitrarily
Keep running $V^{'}$ $(s_{t}) \leftarrow V^{'} (s_{t}) + α (r_{t} + γ V (s_{t + 1}) - V^{'} (s_{t}))$
Note that only $V^{'}$ is being updated; $V$ doesn’t change
What’s the relationship between $V$ and $V^{'}$ $after long enough?$
$V^{'} = T^{π} V$ We’ve completed 1 iter of VI solving $V^{π}$
Copy $V^{'}$ to $V$ , and repeat this procedure again and again
TD(0): almost the same, except that we don’t wait. Copy $V^{'}$ to $V$ after every update!
Algorithms that wait acutally have a come back in Deep RL

TD(0) vs MC

TD(0) target: $r_{t} + γ V (s_{t + 1})$
MC target: $r_{t} + γ r_{t + 1} + γ^{2} r_{t + 2} + \dots$
MC target is unbaised: expectation of target is the $V^{π} (s)$
TD(0) target is baised (w.r.t. $V^{π} (s)$ ): the expected target is $(T^{π} V (s))$
Although the expected target is not $V^{π}$ , it’s closer to $V^{π}$ than where we are now (recall that $T^{π}$ is a contraction)
On the other hand, TD(0) has lower variance than MC
Bias vs variance trade-off
Also a ptractical concern: when interval of a time step is too small (e.g., in robotics), $V (s_{t})$ and $V (s_{t + 1})$ can be very close, and their difference can be burried by errors

$TD (λ)$ : Unifying TD(0) and MC

1-step bootstrap (= TD(0)): $r_{t} + γ V (s_{t + 1})$
2-step bootstrap: $r_{t} + γ r_{t + 1} + γ^{2} V (s_{t + 2})$
3-step bootstrap: $r_{t} + γ r_{t + 1} + γ^{2} r_{t + 2 + γ^{3}} V (s_{t + 3})$

$⋮$

$\infty$ -step bootstrap (=MC=TD(1)): $r_{t} + γ r_{t + 1} + γ^{2} r_{t + 2} + \dots$
Excerise: What’s the exptected target in n-step bootstrap? $(T^{π})^{n} V$
- $G_{t}^{n} = r_{t + 1} + γ r_{t + 2} + \dots + γ^{n - 1} r_{t + n} + γ^{n} V (s_{t + n})$
$TD (λ)$ weighted combination of n-step bootstrapped target, with weighting scheme $(1 - λ) λ^{n - 1}$
$λ = 0$ only $n = 1$ gets full weight. TD(0)
limit $λ \to 1$ : (almost) MC
Why the choice of $(1 - λ) λ^{n - 1}$
Enables efficient online implementation
Backward view of $TD (λ)$

The value Prediction Problem

Monte-Carlo Value Prediction

Implementing MC in an online manner

Every-visit Monte-Carlo

Every-visit Monte-Carlo

Alternative Approach: TD(0)

Understanding TD(0)

TD(0) vs MC

TD(λ): Unifying TD(0) and MC

$TD (λ)$ : Unifying TD(0) and MC