Questions: Actor-Critic Methods

5 questions to test your understanding

Score: 0 / 5
Question 1 Multiple Choice

Why does using the critic's value estimate as a baseline reduce variance in the actor's gradient updates compared to using the full episode return?

AThe critic filters out noisy rewards by averaging them before passing them to the actor
BThe critic's state-value estimate is a learned function of the current state, while the full episode return varies randomly based on all future actions and transitions
CThe critic reduces variance because it uses a larger batch of samples to estimate the gradient
DThe actor's gradient is inherently lower variance because the critic replaces the policy gradient entirely
Question 2 Multiple Choice

An agent is learning to play a video game where episodes last 50,000 steps. Which scenario makes actor-critic most beneficial compared to a pure Monte Carlo policy gradient?

AThe game has a sparse reward (only +1 at victory after 50,000 steps), making it impractical to wait for full episode returns before updating
BThe game has dense rewards every step, so the full episode return is easy to compute and the critic adds unnecessary complexity
CThe action space is discrete, so value-based methods like Q-learning are always preferred
DThe game has a deterministic transition function, eliminating stochasticity in the return
Question 3 True / False

In an actor-critic system, the actor can be updated after every individual time step rather than waiting for a complete episode to end.

TTrue
FFalse
Question 4 True / False

In an actor-critic architecture, both the actor and the critic are updated using Monte Carlo estimates from complete episode returns.

TTrue
FFalse
Question 5 Short Answer

What does the 'advantage' signal measure in actor-critic, and why is it used instead of the raw reward to update the actor?

Think about your answer, then reveal below.