Tabular RL for Value Prediction

The value Prediction Problem


Monte-Carlo Value Prediction


Implementing MC in an online manner


For i=1,2,...Draw a starting state si from the exploratory initial distribution,roll out a trajectory using π from si and let Gi be the (random) discounted returnLet n(si)be the number of times si has appeared as an initialstate. If n(si)=1 (first time seeing this state), let V(si)GiOtherwise, V(si)n(si)1n(si)V(si)+1n(si)GiVerify: at any point, V(s) is always the MC estimation usingtrajectories starting from s available so far

minv12ni=1n(vvi)2

Every-visit Monte-Carlo


s1,a1,r1,s2,a2,r2,s3,a3,r3,s4,...

Every-visit Monte-Carlo


Alternative Approach: TD(0)


s1,a1,r1,s2,a2,r2,s3,a3,r3,s4,...

in a contunuing task

Understanding TD(0)


TD(0) vs MC


TD(λ): Unifying TD(0) and MC