Questions: Policy Gradient Methods

5 questions to test your understanding

Score: 0 / 5
Question 1 Multiple Choice

An actor-critic agent in state s_t takes action a_t and receives a total return G_t = 10. The critic's value estimate for that state is V(s_t) = 9.5. How does the actor update its policy?

AIt strongly increases the probability of a_t, because G_t = 10 is a large positive return
BIt slightly increases the probability of a_t, because the advantage A_t = G_t − V(s_t) = 0.5 is small but positive
CIt decreases the probability of a_t, because V(s_t) = 9.5 indicates the state is already high-value and this action underperformed
DIt does not update, because a_t produced a return above the value estimate and no correction is needed
Question 2 Multiple Choice

Why are policy gradient methods generally preferred over value-based methods like Q-learning for tasks with continuous action spaces?

APolicy gradient methods are guaranteed to converge to the globally optimal policy, while Q-learning may converge to suboptimal policies
BValue-based methods require enumerating all possible actions to select the maximum Q-value, which is infeasible when actions are continuous
CPolicy gradient methods do not require a reward signal, making them more versatile
DQ-learning cannot handle stochastic environments, while policy gradients can
Question 3 True / False

REINFORCE is considered a biased gradient estimator because the return G_t is computed from a single sampled trajectory rather than the true expected return.

TTrue
FFalse
Question 4 True / False

Subtracting a learned value baseline V(s_t) from the return G_t in a policy gradient update reduces variance in the gradient estimate without changing the expected (average) direction of the update.

TTrue
FFalse
Question 5 Short Answer

Explain in your own words what the advantage A_t = G_t − V(s_t) measures, and why using it instead of the raw return G_t makes policy gradient updates more informative and stable.

Think about your answer, then reveal below.