An actor-critic agent in state s_t takes action a_t and receives a total return G_t = 10. The critic's value estimate for that state is V(s_t) = 9.5. How does the actor update its policy?
AIt strongly increases the probability of a_t, because G_t = 10 is a large positive return
BIt slightly increases the probability of a_t, because the advantage A_t = G_t − V(s_t) = 0.5 is small but positive
CIt decreases the probability of a_t, because V(s_t) = 9.5 indicates the state is already high-value and this action underperformed
DIt does not update, because a_t produced a return above the value estimate and no correction is needed
The advantage A_t = G_t − V(s_t) = 10 − 9.5 = 0.5 measures how much *better than expected* the action performed. A small positive advantage yields a small upward adjustment to the action's probability. The key insight is that the update is relative to the baseline, not absolute: a return of 10 in a state that typically yields 9.5 is only marginally good, not exceptional. Without the baseline (raw REINFORCE), the large raw return of 10 would cause a stronger, noisier update regardless of context — the baseline removes this variance.
Question 2 Multiple Choice
Why are policy gradient methods generally preferred over value-based methods like Q-learning for tasks with continuous action spaces?
APolicy gradient methods are guaranteed to converge to the globally optimal policy, while Q-learning may converge to suboptimal policies
BValue-based methods require enumerating all possible actions to select the maximum Q-value, which is infeasible when actions are continuous
CPolicy gradient methods do not require a reward signal, making them more versatile
DQ-learning cannot handle stochastic environments, while policy gradients can
In Q-learning, the greedy policy requires argmax_a Q(s,a) — finding the action that maximizes Q. When actions are discrete and finite, you enumerate them. When actions are real-valued (e.g., continuous torques for a robotic arm), enumeration is impossible and even optimization over the action space at every step is expensive. Policy gradient methods sidestep this entirely: a parameterized policy directly outputs an action or a distribution over actions, with gradient ascent updating the parameters. Continuous action spaces — Gaussian policies, for example — are handled naturally.
Question 3 True / False
REINFORCE is considered a biased gradient estimator because the return G_t is computed from a single sampled trajectory rather than the true expected return.
TTrue
FFalse
Answer: False
REINFORCE is actually *unbiased* — in expectation, the gradient estimate ∇_θ log π_θ(a_t|s_t) · G_t points in the correct direction of steepest ascent for J(θ). The problem with REINFORCE is not bias but *high variance*: G_t depends on everything that happens after time t, and a single trajectory is a noisy sample of the expected return. This variance makes learning slow and unstable. The actor-critic remedy — subtracting a value baseline — reduces variance without introducing bias, because the expected value of a state-dependent baseline multiplied by the log-gradient is zero.
Question 4 True / False
Subtracting a learned value baseline V(s_t) from the return G_t in a policy gradient update reduces variance in the gradient estimate without changing the expected (average) direction of the update.
TTrue
FFalse
Answer: True
This is a key theoretical property of baselines. The expected value of ∇_θ log π_θ(a_t|s_t) · b(s_t) is zero for any function b that depends only on the state (not the action), because E[∇_θ log π_θ(a|s)] = 0 by the log-derivative trick and normalization of the policy. Therefore, subtracting b(s_t) = V(s_t) from G_t leaves the expected gradient unchanged — no bias is introduced. But the variance is reduced because the advantage A_t = G_t − V(s_t) has smaller fluctuations than G_t alone: the baseline absorbs the 'background level' of return, leaving only the surprising component.
Question 5 Short Answer
Explain in your own words what the advantage A_t = G_t − V(s_t) measures, and why using it instead of the raw return G_t makes policy gradient updates more informative and stable.
Think about your answer, then reveal below.
Model answer: The advantage measures how much better (or worse) the actual return from action a_t was compared to what the agent would typically expect from state s_t. A positive advantage means a_t led to a better-than-average outcome; a negative advantage means it led to a worse-than-average outcome. Using raw G_t is noisy because even mediocre actions get large positive updates in high-reward environments. The advantage centers the signal: an action that achieves the expected return gets nearly zero update, while only surprisingly good or bad actions produce strong updates. This reduces variance in the gradient estimate, leading to more stable and efficient learning — the policy changes meaningfully only when an action is genuinely above or below expectations.
The advantage concept also reveals why the actor-critic architecture is powerful: the critic learns the value function V(s) from experience, essentially building a model of 'what's normal' for each state, which the actor then uses to calibrate whether its actions are exceptional. The separation of policy (actor) from value function (critic) — two different neural networks with different objectives — is central to most modern deep RL algorithms including PPO and A3C.