5 questions to test your understanding
An agent in state s takes an action, receives reward r = -5, and transitions to state s'. The current value estimates are V(s) = 10 and V(s') = 8, with discount γ = 0.9 and learning rate α = 0.1. What is the TD(0) update to V(s)?
Compared to Monte Carlo methods, TD(0) has lower variance but higher bias in its value estimates. What is the source of this bias?
TD learning can update value estimates after every single step, without waiting for an episode to complete.
TD(λ) with λ = 1 is equivalent to TD(0) because eligibility traces decay to zero after one step when λ = 1.
What is bootstrapping in the context of TD learning, and why does it allow TD methods to learn online while also introducing bias?