Questions: Policy Networks and Policy Gradients

5 questions to test your understanding

Score: 0 / 5
Question 1 Multiple Choice

An agent using REINFORCE takes a mediocre action in a given state, but the rest of the episode happens to go very well (due to luck), resulting in a high return. What does the unmodified REINFORCE algorithm do, and why is this a problem?

AIt correctly ignores the high return because it knows the initial action was mediocre
BIt strongly reinforces the mediocre action because it scales the gradient update by the high return, regardless of whether the action deserved credit
CIt skips the update for that episode because the variance is too high
DIt penalizes the action because the return was unusually high compared to the baseline
Question 2 Multiple Choice

In REINFORCE with a baseline, the policy gradient is scaled by (Gₜ − b(sₜ)) instead of Gₜ. What does this change accomplish, and what property makes it mathematically valid?

AIt biases the gradient toward higher returns, making the algorithm converge faster at the cost of accuracy
BIt eliminates variance entirely by normalizing all returns to zero mean
CIt reduces variance while keeping the gradient estimate unbiased, because the expected value of a state-dependent baseline over all actions is zero
DIt converts the policy gradient into a value-based update, making the algorithm equivalent to Q-learning
Question 3 True / False

Subtracting a state-dependent baseline from the return in REINFORCE introduces bias into the policy gradient estimate.

TTrue
FFalse
Question 4 True / False

Policy networks are better suited than value-based methods (like Q-learning) for tasks with continuous action spaces.

TTrue
FFalse
Question 5 Short Answer

Why does the REINFORCE algorithm suffer from high variance, and how does introducing an advantage function (return minus baseline) address this problem?

Think about your answer, then reveal below.