Questions: Monte Carlo Methods in Reinforcement Learning
5 questions to test your understanding
Score: 0 / 5
Question 1 Multiple Choice
An agent uses Monte Carlo value estimation. After 10 episodes, its estimate for a state fluctuates wildly. After 10,000 episodes, the estimate converges. This pattern is best explained by:
AMonte Carlo uses bootstrapping, which introduces bias that corrects itself with more data
BMonte Carlo estimates are unbiased but have high variance; averaging many complete returns is needed for convergence
CEarly episodes use a discount factor of γ=1, causing instability that is corrected in later training
DThe reward signal was incorrectly calibrated in the first 10 episodes
Monte Carlo methods do NOT bootstrap — they use the actual return from each complete episode, making estimates unbiased. However, each episode's return is just one noisy sample of the true expected return: random events throughout the episode add variance. With only 10 samples, the average is unreliable; with 10,000, the law of large numbers drives the sample mean toward the true value. The high-variance, low-bias tradeoff is intrinsic to MC's design of using full episode returns.
Question 2 Multiple Choice
You have logs from an old policy and want to evaluate a new policy without collecting new data. Which Monte Carlo approach makes this possible?
AEvery-visit Monte Carlo, which averages over all visits to each state within an episode
BOff-policy Monte Carlo with importance sampling, which reweights each return by the ratio of target to behavior policy probabilities
CFirst-visit Monte Carlo, restricted to only the first visit to each state per episode
DModel-based Monte Carlo, which builds an explicit transition model from the logged trajectories
Importance sampling corrects for the mismatch between the behavior policy (which generated the data) and the target policy (which we want to evaluate). Each return is multiplied by the product of probability ratios along the trajectory — how likely was this sequence of actions under the target policy vs. the behavior policy. This reweighting makes the estimates valid for the target policy, enabling learning from historical data without new interaction.
Question 3 True / False
Monte Carlo methods in RL bootstrap — they update value estimates using other estimated values — which is why they require mainly partial episodes to update.
TTrue
FFalse
Answer: False
This describes temporal-difference (TD) learning, not Monte Carlo. Monte Carlo methods do the opposite: they wait for the complete episode to end, then use the actual observed return (not any estimated value) to update. This is why MC cannot update mid-episode and why its estimates are unbiased — there is no estimated value injected into the update, only real observed outcomes.
Question 4 True / False
Ordinary importance sampling in off-policy Monte Carlo produces unbiased estimates but can have extremely high variance when the target and behavior policies differ substantially.
TTrue
FFalse
Answer: True
When the target policy assigns high probability to actions that the behavior policy rarely took, the importance sampling ratio becomes very large, causing individual weighted returns to be enormous — inflating variance dramatically. Weighted importance sampling addresses this by normalizing, reducing variance at the cost of a small bias. This variance-bias tradeoff is a core practical consideration when choosing between the two variants.
Question 5 Short Answer
What does it mean for Monte Carlo value estimates to be 'unbiased but high variance,' and why does this tradeoff arise from the method's design?
Think about your answer, then reveal below.
Model answer: Unbiased means the expected value of the estimate equals the true value function — given enough data, Monte Carlo converges to the correct answer without systematic error. High variance means individual estimates can differ wildly from the true value because each estimate is based on a single episode's return, which depends on every random event from that state to the end of the episode. The tradeoff is fundamental: using complete actual returns (no bootstrapping) guarantees no bias from incorrect value estimates, but it also means each sample carries the full noise of an entire episode rather than a one-step correction.
This contrasts with TD learning, which bootstraps (uses estimated values), introducing bias but dramatically reducing variance by updating based on a single step rather than a full episode. The MC/TD tradeoff is one of the core tensions in RL: pure MC is unbiased but slow to converge due to variance; pure TD is biased but lower-variance; methods like TD(λ) interpolate between them.