Questions — Actor-Critic Methods — Open Knowledge Graph

Question 1 Multiple Choice

Why does using the critic's value estimate as a baseline reduce variance in the actor's gradient updates compared to using the full episode return?

AThe critic filters out noisy rewards by averaging them before passing them to the actor

BThe critic's state-value estimate is a learned function of the current state, while the full episode return varies randomly based on all future actions and transitions

CThe critic reduces variance because it uses a larger batch of samples to estimate the gradient

DThe actor's gradient is inherently lower variance because the critic replaces the policy gradient entirely

Question 2 Multiple Choice

An agent is learning to play a video game where episodes last 50,000 steps. Which scenario makes actor-critic most beneficial compared to a pure Monte Carlo policy gradient?

AThe game has a sparse reward (only +1 at victory after 50,000 steps), making it impractical to wait for full episode returns before updating

BThe game has dense rewards every step, so the full episode return is easy to compute and the critic adds unnecessary complexity

CThe action space is discrete, so value-based methods like Q-learning are always preferred

DThe game has a deterministic transition function, eliminating stochasticity in the return

Question 3 True / False

In an actor-critic system, the actor can be updated after every individual time step rather than waiting for a complete episode to end.

TTrue

FFalse

Question 4 True / False

In an actor-critic architecture, both the actor and the critic are updated using Monte Carlo estimates from complete episode returns.

TTrue

FFalse

Question 5 Short Answer

What does the 'advantage' signal measure in actor-critic, and why is it used instead of the raw reward to update the actor?

Think about your answer, then reveal below.

Questions: Actor-Critic Methods