An agent using REINFORCE takes a mediocre action in a given state, but the rest of the episode happens to go very well (due to luck), resulting in a high return. What does the unmodified REINFORCE algorithm do, and why is this a problem?
AIt correctly ignores the high return because it knows the initial action was mediocre
BIt strongly reinforces the mediocre action because it scales the gradient update by the high return, regardless of whether the action deserved credit
CIt skips the update for that episode because the variance is too high
DIt penalizes the action because the return was unusually high compared to the baseline
REINFORCE scales the log-probability gradient by the raw episode return Gₜ. It has no way to distinguish whether the high return resulted from the action taken or from lucky subsequent events — it simply reinforces whatever action was taken, in proportion to the total return. This is the variance problem: returns from individual episodes are noisy estimators of an action's true value. A mediocre action that happens to precede a lucky sequence gets reinforced as if it were excellent, slowing and destabilizing learning. A baseline that estimates the expected return from that state provides a reference point so only actions that beat expectations get reinforced.
Question 2 Multiple Choice
In REINFORCE with a baseline, the policy gradient is scaled by (Gₜ − b(sₜ)) instead of Gₜ. What does this change accomplish, and what property makes it mathematically valid?
AIt biases the gradient toward higher returns, making the algorithm converge faster at the cost of accuracy
BIt eliminates variance entirely by normalizing all returns to zero mean
CIt reduces variance while keeping the gradient estimate unbiased, because the expected value of a state-dependent baseline over all actions is zero
DIt converts the policy gradient into a value-based update, making the algorithm equivalent to Q-learning
The mathematical key is that Eₐ[∇θ log π(a|s; θ) × b(s)] = b(s) × ∇θ Eₐ[log π(a|s; θ)] = b(s) × 0 = 0, because log-probabilities over all actions sum to zero in expectation. This means subtracting any state-dependent baseline leaves the expected gradient unchanged — the estimate is still unbiased. But individual samples now compare each action against b(s), dramatically reducing the variance of the estimate. The quantity Gₜ − b(sₜ) is the 'advantage' — positive when this action was better than expected, negative when worse — which is a much lower-variance signal than the raw return.
Question 3 True / False
Subtracting a state-dependent baseline from the return in REINFORCE introduces bias into the policy gradient estimate.
TTrue
FFalse
Answer: False
This is the central mathematical insight of the baseline technique. A state-dependent baseline b(sₜ) does not bias the gradient because, in expectation over actions, the baseline term cancels out. The expected gradient with the baseline is identical to the expected gradient without it. What changes is the variance — individual gradient estimates become much lower-variance because actions are now scored relative to what was expected in that state, not by their absolute returns. An unbiased, lower-variance estimator is strictly better for learning.
Question 4 True / False
Policy networks are better suited than value-based methods (like Q-learning) for tasks with continuous action spaces.
TTrue
FFalse
Answer: True
Value-based methods like Q-learning require computing or approximating Q(s, a) for all actions, then selecting the action that maximizes this value. In continuous action spaces, that maximization over infinitely many actions is generally intractable. Policy networks sidestep this by directly outputting a probability distribution over actions — for instance, the mean and variance of a Gaussian for continuous control — and can be trained purely through gradient ascent on expected return. This makes policy-based methods the natural choice for robotics, locomotion, and any domain where the action space cannot be discretized without losing important precision.
Question 5 Short Answer
Why does the REINFORCE algorithm suffer from high variance, and how does introducing an advantage function (return minus baseline) address this problem?
Think about your answer, then reveal below.
Model answer: REINFORCE is high-variance because each gradient update uses the full episode return, which conflates the quality of the specific action taken with all subsequent luck and chance outcomes. A single episode's return is a noisy sample of the true expected return, and this noise directly scales the gradient update. The advantage function (Gₜ − b(sₜ)) addresses this by comparing the actual return to an estimate of what was expected from that state. If b(sₜ) ≈ V(sₜ), then the advantage is near zero on average and varies only with whether the specific action was better or worse than average — a much lower-variance signal that isolates the true contribution of the action.
The key insight is that raw returns measure everything that happened in an episode, while advantages measure only whether this particular action made things better or worse than expected. The latter is what you actually want to learn from, and it can be estimated with far less noise. This is why virtually every practical policy gradient method uses some form of advantage estimation rather than raw returns.