Tabular RL for Value Prediction

The value Prediction Problem


Monte-Carlo Value Prediction


Implementing MC in an online manner


\[\begin{align*} \text{For } & i = 1, 2, ... \\ & \text{Draw a starting state } s_i \text{ from the exploratory initial distribution,} \\ & \text{roll out a trajectory using } \pi \text{ from } s_i \text{ and let } G_i \text{ be the (random) } \\ & \text{discounted return} \\ & \text{Let } n(s_i) \text{be the number of times } s_i \text{ has appeared as an initial} \\ & \text{state. If $n(s_i) = 1$ (first time seeing this state), let $V(s_i) \gets G_i$} \\ & \text{Otherwise, $V(s_i) \gets \frac{n(s_i) - 1}{n(s_i)} V(s_i) + \frac{1}{n(s_i)}G_i$} \\ & \text{Verify: at any point, $V(s)$ is always the MC estimation using} \\ & \text{trajectories starting from $s$ available so far} \end{align*}\]

\[ \min_v \frac{1}{2n} \sum_{i=1}^n (v - v_i)^2\]

Every-visit Monte-Carlo


\[ s_1, a_1, r_1, s_2, a_2, r_2, s_3, a_3, r_3, s_4, ... \]

Every-visit Monte-Carlo


Alternative Approach: TD(0)


\[ s_1, a_1, r_1, s_2, a_2, r_2, s_3, a_3, r_3, s_4, ...\]

in a contunuing task

Understanding TD(0)


TD(0) vs MC


\(\text{TD}(\lambda)\): Unifying TD(0) and MC


\(\qquad\qquad\qquad\qquad\qquad \vdots\)