Major limitation of tabular RL: does not scale to alrge state space
most methods require that we run into the same state multiple times
When the state space is large, you might not see the same state even twice!
In other words: sample complexity scales with \(|S|\)
Need generalization
For value prediction problem, generalization requires that we have some prior knowledge about the form of the value function
- Linear function approximation: design features \(\phi (s) \in \mathbb{R}^d\) (featurizing states), and approximate \(V^\pi (s) \approx \theta^\top \phi(s)\)
- Only unknown \(\theta \: d\) unknowns vs \(|S|\) unknowns!
Draw a starting state \(s_i\) from the eploratory initial distribution, roll out a trajectory using \(\pi\) from \(s_i\), and let \(G_i\) ne the (random) discounted return
Collect \(\{(s_i, G_i)\}\) pairs
Least square regression: \(\min_\theta \frac{1}{n} \sum_{i=1}^n (\theta^\top \phi (s_i) - G_i)^2\)
Why this works?
- Assume \(\{(s_i, G_i)\}\) are i.i.d., let \((s, G)\) be variables equal in distribution
- The expected version of the objective: \(\min_\theta \mathbb{E}_{s, G} \left[(\theta^\top \phi(s) - G)^2\right]\)
- If we do not restrict ourselves to linear functions, the function that minimizes this objective is \(S \mapsto \mathbb{E} \left[G | s\right] (= V^\pi (s))\)
- If the true \(V^\pi(s)\) happens to take linear form, the regression will find it in the limit (of infinite data)
- Finite sample regime: bias & variance trade-off
The same idea applies to non-linear value function approximation
More generally & abstractly, think of function approximation as searching over a restricted \(\text{function space}\), which is a set whose memebers are functions that map state to real values
Function space of linear value function approximation: \(\mathcal{F} = \{V_\theta : \theta \in \mathbb{R}^d\}\), where \(V_\theta (s) = \theta^\top \phi(s)\)
- Typically only a small subset of all possible functions
- Using all possible functions = tabular!
- Equivalently, tabular MC value prediction can be recovered by choosing \(\phi\) as the identity featrues \(\phi (s) = \{\mathbb{I} [s = s^\prime]\}_{s^\prime \in S}\)
\(\min_{V_\theta \in \mathcal{F}} \frac{1}{n} \sum_{i=1}^n (V_\theta (s_i) - G_i)^2\)
Plug in any function approximator of your choice
SGD: uniformly sample \(i\) and \(\theta \gets \theta - \alpha \cdot (V_\theta (s_i) - G_i) \cdot \nabla V_\theta (s_i)\)
Tabular: \(V(s_t) \gets V(s_t) + \alpha (r_t + \gamma V(s_{t+1}) - V(s_t))\)
When we update \(V(s_t)\), the target is \(r_t + \gamma V(s_{t + 1})\)
Batch version of the algorithm: one Bellman update can be approximator (using all data) as \[ V_{k+1} \gets \text{arg}\min_{V_\theta \in \mathcal{F}} \frac{1}{T} \sum_{t=1}^T (V_\theta (s_t) - r - \gamma V_k(s_{t + 1}))^2\]
SGD + no-wait : \(\theta \gets \theta - \alpha \cdot (V_\theta (s_t) - r - \gamma V_\theta (s_{t + 1})) \cdot \nabla V_\theta (s_t)\)
When using linear function approximator \(V_\theta (s) = \phi(s)^\top \theta\), we have \(\theta \gets \theta - \alpha \cdot (\phi(s_t)^\top \theta - r - \gamma \phi (s_{t+1})^\top \theta) \cdot \phi (s_t)\)
When using chain rule, we only take gradient on \(V_\theta(s_t)\) and ignore \(V_\theta (s_{t+1})\); the latter is treated as a constant (it plays the role of \(V_k\))
What happens if we also differentiate \(V_\theta(s_{t+1})?\)
This corresponds to \({\text{arg}\min}_{V_\theta \in \mathcal{F}} \sum_{(s, r, s^\prime)} (V_\theta (s) - r - \gamma V_\theta (s^\prime))^2\)
- no iteration anymore; a clean optimization objective
- (most RL algorithms with bootstrapped target do not have a fixed optimization objective; objective changes over time)
Assume for simplicity that, each data point is generated by -(1) sampling \(s\) i.i.d., from some exploratory distribution, and -(2) generating \(r\) and \(s^\prime\) conditioned on \((s, \pi(s))\)
Replacing empirical objective with the population version, the objective becomes \(\mathbb{E}_{s, r, s^\prime} \left[(V_\theta (s) - r - \gamma V_\theta (s^\prime))^2\right]\)