Questions: Monte Carlo Methods in Reinforcement Learning

5 questions to test your understanding

Score: 0 / 5
Question 1 Multiple Choice

An agent uses Monte Carlo value estimation. After 10 episodes, its estimate for a state fluctuates wildly. After 10,000 episodes, the estimate converges. This pattern is best explained by:

AMonte Carlo uses bootstrapping, which introduces bias that corrects itself with more data
BMonte Carlo estimates are unbiased but have high variance; averaging many complete returns is needed for convergence
CEarly episodes use a discount factor of γ=1, causing instability that is corrected in later training
DThe reward signal was incorrectly calibrated in the first 10 episodes
Question 2 Multiple Choice

You have logs from an old policy and want to evaluate a new policy without collecting new data. Which Monte Carlo approach makes this possible?

AEvery-visit Monte Carlo, which averages over all visits to each state within an episode
BOff-policy Monte Carlo with importance sampling, which reweights each return by the ratio of target to behavior policy probabilities
CFirst-visit Monte Carlo, restricted to only the first visit to each state per episode
DModel-based Monte Carlo, which builds an explicit transition model from the logged trajectories
Question 3 True / False

Monte Carlo methods in RL bootstrap — they update value estimates using other estimated values — which is why they require mainly partial episodes to update.

TTrue
FFalse
Question 4 True / False

Ordinary importance sampling in off-policy Monte Carlo produces unbiased estimates but can have extremely high variance when the target and behavior policies differ substantially.

TTrue
FFalse
Question 5 Short Answer

What does it mean for Monte Carlo value estimates to be 'unbiased but high variance,' and why does this tradeoff arise from the method's design?

Think about your answer, then reveal below.