Questions — Policy Networks and Policy Gradients

Question 1 Multiple Choice

An agent using REINFORCE takes a mediocre action in a given state, but the rest of the episode happens to go very well (due to luck), resulting in a high return. What does the unmodified REINFORCE algorithm do, and why is this a problem?

AIt correctly ignores the high return because it knows the initial action was mediocre

BIt strongly reinforces the mediocre action because it scales the gradient update by the high return, regardless of whether the action deserved credit

CIt skips the update for that episode because the variance is too high

DIt penalizes the action because the return was unusually high compared to the baseline

Question 2 Multiple Choice

In REINFORCE with a baseline, the policy gradient is scaled by (Gₜ − b(sₜ)) instead of Gₜ. What does this change accomplish, and what property makes it mathematically valid?

AIt biases the gradient toward higher returns, making the algorithm converge faster at the cost of accuracy

BIt eliminates variance entirely by normalizing all returns to zero mean

CIt reduces variance while keeping the gradient estimate unbiased, because the expected value of a state-dependent baseline over all actions is zero

DIt converts the policy gradient into a value-based update, making the algorithm equivalent to Q-learning

Question 3 True / False

Subtracting a state-dependent baseline from the return in REINFORCE introduces bias into the policy gradient estimate.

TTrue

FFalse

Question 4 True / False

Policy networks are better suited than value-based methods (like Q-learning) for tasks with continuous action spaces.

TTrue

FFalse

Question 5 Short Answer

Why does the REINFORCE algorithm suffer from high variance, and how does introducing an advantage function (return minus baseline) address this problem?

Think about your answer, then reveal below.

Questions: Policy Networks and Policy Gradients