Tabular RL for Value Prediction
The value Prediction Problem
Given a policy \(\pi\), we want to learn \(V^\pi\) or \(Q^\pi\)
Why is it useful? Recall that if we know how to compute \(Q^\pi\), we can run policy iteration
On-policy learning: data is generated by \(\pi\)
off-policy learning: data is generated by some other policy
When action is always chosen by a fixed policy, the MDP reduces to a Markov chain plus a reward function over states, also known as Markov Reward Processes (MRP)
Monte-Carlo Value Prediction
If we can roll out trajectories from any starting state that we want, here is a simple procedure
For each \(s\), roll out \(n\) trajectories using policy \(\pi\)
- For episodic tasks, roll out until termination
- For continuing tasks, roll out to a length (typically \(H= O(1 / (1 - \gamma))\)) such that omitting future rewards has minimal impact (small truncation error)
- Let \(\hat{V}^\pi (s)\) (will just write \(V(s)\)) be the average discounted return
Also works if we can draw starting state from an exploratory initial distribution (i.e., one that assigns non-zero probability to every state)
Keep generating trajectories until we have enough data points for each starting state
Implementing MC in an online manner
- The previous procedure assumes that we collect all the data, store them, and then process them (batch-mode learning)
- Can we process each data point as they come, without ever needing to store them? (online, one-pass algorithm)
\[\begin{align*} \text{For } & i = 1, 2, ... \\ & \text{Draw a starting state } s_i \text{ from the exploratory initial distribution,} \\ & \text{roll out a trajectory using } \pi \text{ from } s_i \text{ and let } G_i \text{ be the (random) } \\ & \text{discounted return} \\ & \text{Let } n(s_i) \text{be the number of times } s_i \text{ has appeared as an initial} \\ & \text{state. If $n(s_i) = 1$ (first time seeing this state), let $V(s_i) \gets G_i$} \\ & \text{Otherwise, $V(s_i) \gets \frac{n(s_i) - 1}{n(s_i)} V(s_i) + \frac{1}{n(s_i)}G_i$} \\ & \text{Verify: at any point, $V(s)$ is always the MC estimation using} \\ & \text{trajectories starting from $s$ available so far} \end{align*}\]
More generally, \(V(s_i) \gets (1 - \alpha) V(s_i) + \alpha G_i\)
\(\alpha\) is known as the step size or the learning rate
in theory, convergence require sum of \(\alpha\) goes to infinity while sum of \(\alpha^2\)-stays finite; in practice, constant small \(\alpha\) is often used
\(G_i\) is often called the \(\text{target}\)
The expected value of the target is what we want to update our estimate to, but since it’s noisy, we only move slightly to it
Alternative expression: \(V(s_i) \gets V(s_i) + \alpha(G_i - V(s_i))\)
Moving the estimate in the direction of error (= target - current)
Can be interpreted as stochastic gradient descent
If we have i.i.d. real random variables \(v_1, v_2, ..., v_n\) the average is the solution of the least-square optimization problem:
\[ \min_v \frac{1}{2n} \sum_{i=1}^n (v - v_i)^2\]
- Stochastic gradient: \(v - v_i\) (for uniformly random \(i\))
Every-visit Monte-Carlo
Supose we have a continuing task. What if we cannot set the starting state arbitrarily?
Let’s say we only have one single long trajectory
\[ s_1, a_1, r_1, s_2, a_2, r_2, s_3, a_3, r_3, s_4, ... \]
(By long trajectory, we mean trjectory length >> effective horizon \(H = O(1 / (1 - \gamma))\))
On-policy: \(a_t \sim \pi(s_t)\), where \(\pi\) is the policy we want to evaluate
Algorithm: for each \(s\), find all \(t\) such that \(s_t = s\), calculate the discounted sum of rewards between time step \(t\) and \(t + H\), and take a average over them as \(V(s_i)\)
Convergence requires additional assumptions: the Markov chain induced by \(\pi\) is ergodic–implying that all satate will be hit infinitely often if the trajectory length grows to infinity
Every-visit Monte-Carlo
You can use this idea to improve the algorithm when we can choose the starting state & the MDP is episodic
i.e., obtain a random return for each state visited on the trajectory
What if a state occurs multiple times on a trajectory?
Approach 1: only the 1st occurrence is used (first-visit MC)
Approach 2: all of them are used (every-visit MC)
Alternative Approach: TD(0)
- Again suppose we have a single long trajectory
\[ s_1, a_1, r_1, s_2, a_2, r_2, s_3, a_3, r_3, s_4, ...\]
in a contunuing task
\(\text{TD(0): for } t = 1, 2, 3, ..., V(s_t) \gets V(s_t) + \alpha (r_t + \gamma V(s_{t +1}) = V(s_t))\)
TD = temperal difference
\(r_t + \gamma V(s_{t + 1}) - V(s_t):\) TD-error
The same structure as the MC update rule, except that we are using a different targert here: \(r_t + \gamma V(s_{t + 1})\)
Often callled bootstrapped target: the target value depends on our current estimated value function \(V\)
Conditioned on \(s_t\), what is the expected value of the target (taking expectation over the randomness of \(r_t, s_{t + 1}\))?
It’s \((\mathcal{T}^\pi V)(s_t)\)
Understanding TD(0)
\(V(s_t) \gets V(s_t) + \alpha (r_t + \gamma V(s_{t + 1}) - V(s))\)
Imagine a slightly different procedure
Initialize \(\textcolor{red}{V}\) and \(\textcolor{blue}{V^\prime}\) arbitrarily
Keep running \(\textcolor{blue}{V^\prime}\)\((s_t) \gets \textcolor{blue}{V^\prime}(s_t) + \alpha (r_t + \gamma \textcolor{red}{V}(s_{t + 1}) - \textcolor{blue}{V^\prime}(s_t))\)
Note that only \(\textcolor{blue}{V^\prime}\) is being updated; \(\textcolor{red}{V}\) doesn’t change
What’s the relationship between \(\textcolor{red}{V}\) and \(\textcolor{blue}{V^\prime}\) \(\text{after long enough?}\)
\(\textcolor{blue}{V^\prime} = \mathcal{T}^\pi \textcolor{red}{V}\) We’ve completed 1 iter of VI solving \(V^\pi\)
Copy \(\textcolor{blue}{V^\prime}\) to \(\textcolor{red}{V}\), and repeat this procedure again and again
TD(0): almost the same, except that we don’t wait. Copy \(\textcolor{blue}{V^\prime}\) to \(\textcolor{red}{V}\) after every update!
Algorithms that wait acutally have a come back in Deep RL
TD(0) vs MC
TD(0) target: \(r_t + \gamma V(s_{t+1})\)
MC target: \(r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \cdots\)
MC target is unbaised: expectation of target is the \(V^\pi(s)\)
TD(0) target is baised (w.r.t. \(V^\pi(s)\)): the expected target is \((\mathcal{T}^\pi V(s))\)
Although the expected target is not \(V^\pi\), it’s closer to \(V^\pi\) than where we are now (recall that \(\mathcal{T}^\pi\) is a contraction)
On the other hand, TD(0) has lower variance than MC
Bias vs variance trade-off
Also a ptractical concern: when interval of a time step is too small (e.g., in robotics), \(V(s_t)\) and \(V(s_{t+1})\) can be very close, and their difference can be burried by errors
\(\text{TD}(\lambda)\): Unifying TD(0) and MC
1-step bootstrap (= TD(0)): \(r_t + \gamma V(s_{t+1})\)
2-step bootstrap: \(r_t + \gamma r_{t+1} + \gamma^2 V(s_{t+2})\)
3-step bootstrap: \(r_t + \gamma r_{t+1} + \gamma^2 r_{t+2 + \gamma^3} V(s_{t+3})\)
\(\qquad\qquad\qquad\qquad\qquad \vdots\)
\(\infty\)-step bootstrap (=MC=TD(1)): \(r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \cdots\)
Excerise: What’s the exptected target in n-step bootstrap? \((\mathcal{T}^\pi)^n V\)
- \(G^n_{t} = r_{t+1} + \gamma r_{t+2} + \cdots + \gamma^{n-1} r_{t+n} + \gamma^n V(s_{t+n})\)
\(\text{TD}(\lambda)\) weighted combination of n-step bootstrapped target, with weighting scheme \((1-\lambda) \lambda^{n - 1}\)
\(\lambda = 0\) only \(n = 1\) gets full weight. TD(0)
limit \(\lambda \rightarrow 1\): (almost) MC
Why the choice of \((1 - \lambda) \lambda^{n-1}\)
Enables efficient online implementation
Backward view of \(\text{TD}(\lambda)\)