Questions — Policy Gradient Methods

Question 1 Multiple Choice

An actor-critic agent in state s_t takes action a_t and receives a total return G_t = 10. The critic's value estimate for that state is V(s_t) = 9.5. How does the actor update its policy?

AIt strongly increases the probability of a_t, because G_t = 10 is a large positive return

BIt slightly increases the probability of a_t, because the advantage A_t = G_t − V(s_t) = 0.5 is small but positive

CIt decreases the probability of a_t, because V(s_t) = 9.5 indicates the state is already high-value and this action underperformed

DIt does not update, because a_t produced a return above the value estimate and no correction is needed

Question 2 Multiple Choice

Why are policy gradient methods generally preferred over value-based methods like Q-learning for tasks with continuous action spaces?

APolicy gradient methods are guaranteed to converge to the globally optimal policy, while Q-learning may converge to suboptimal policies

BValue-based methods require enumerating all possible actions to select the maximum Q-value, which is infeasible when actions are continuous

CPolicy gradient methods do not require a reward signal, making them more versatile

DQ-learning cannot handle stochastic environments, while policy gradients can

Question 3 True / False

REINFORCE is considered a biased gradient estimator because the return G_t is computed from a single sampled trajectory rather than the true expected return.

TTrue

FFalse

Question 4 True / False

Subtracting a learned value baseline V(s_t) from the return G_t in a policy gradient update reduces variance in the gradient estimate without changing the expected (average) direction of the update.

TTrue

FFalse

Question 5 Short Answer

Explain in your own words what the advantage A_t = G_t − V(s_t) measures, and why using it instead of the raw return G_t makes policy gradient updates more informative and stable.

Think about your answer, then reveal below.

Questions: Policy Gradient Methods