Questions: Temporal Difference Learning

5 questions to test your understanding

Score: 0 / 5

Question 1 Multiple Choice

An agent in state s takes an action, receives reward r = -5, and transitions to state s'. The current value estimates are V(s) = 10 and V(s') = 8, with discount γ = 0.9 and learning rate α = 0.1. What is the TD(0) update to V(s)?

AV(s) ← 10 + 0.1 × (−5 + 0.9 × 8 − 10) = 10 + 0.1 × (−7.8) = 9.22

BV(s) ← 10 + 0.1 × (−5 − 10) = 8.5

CV(s) ← −5 + 0.9 × 8 = 2.2

DV(s) ← 10 − 0.1 × 8 = 9.2

Question 2 Multiple Choice

Compared to Monte Carlo methods, TD(0) has lower variance but higher bias in its value estimates. What is the source of this bias?

ATD(0) uses a smaller learning rate, which causes estimates to converge to a slightly different value

BTD(0) updates use V(s') — itself just an estimate — rather than a true observed return, so errors in V(s') propagate into the update for V(s)

CTD(0) discounts future rewards with γ < 1, which systematically undervalues long-horizon states

DTD(0) only updates the immediately visited state, missing contributions from earlier states in the trajectory

Question 3 True / False

TD learning can update value estimates after every single step, without waiting for an episode to complete.

TTrue

FFalse

Question 4 True / False

TD(λ) with λ = 1 is equivalent to TD(0) because eligibility traces decay to zero after one step when λ = 1.

TTrue

FFalse

Question 5 Short Answer

What is bootstrapping in the context of TD learning, and why does it allow TD methods to learn online while also introducing bias?

Think about your answer, then reveal below.