Questions — Monte Carlo Methods in Reinforcement Learning

Question 1 Multiple Choice

An agent uses Monte Carlo value estimation. After 10 episodes, its estimate for a state fluctuates wildly. After 10,000 episodes, the estimate converges. This pattern is best explained by:

AMonte Carlo uses bootstrapping, which introduces bias that corrects itself with more data

BMonte Carlo estimates are unbiased but have high variance; averaging many complete returns is needed for convergence

CEarly episodes use a discount factor of γ=1, causing instability that is corrected in later training

DThe reward signal was incorrectly calibrated in the first 10 episodes

Question 2 Multiple Choice

You have logs from an old policy and want to evaluate a new policy without collecting new data. Which Monte Carlo approach makes this possible?

AEvery-visit Monte Carlo, which averages over all visits to each state within an episode

BOff-policy Monte Carlo with importance sampling, which reweights each return by the ratio of target to behavior policy probabilities

CFirst-visit Monte Carlo, restricted to only the first visit to each state per episode

DModel-based Monte Carlo, which builds an explicit transition model from the logged trajectories

Question 3 True / False

Monte Carlo methods in RL bootstrap — they update value estimates using other estimated values — which is why they require mainly partial episodes to update.

TTrue

FFalse

Question 4 True / False

Ordinary importance sampling in off-policy Monte Carlo produces unbiased estimates but can have extremely high variance when the target and behavior policies differ substantially.

TTrue

FFalse

Question 5 Short Answer

What does it mean for Monte Carlo value estimates to be 'unbiased but high variance,' and why does this tradeoff arise from the method's design?

Think about your answer, then reveal below.

Questions: Monte Carlo Methods in Reinforcement Learning