A student claims MCTS evaluates board positions using a heuristic function, just like minimax. What actually happens during the simulation (rollout) phase of MCTS?
AMCTS calls a neural network to evaluate the position and assign a score
BMCTS uses the same alpha-beta pruning as minimax but in a randomized order
CMCTS plays the game out to completion using a random or lightly guided policy, then uses the win/loss result as the evaluation
DMCTS applies a domain-specific heuristic to estimate the probability of winning from the position
The key innovation of MCTS is replacing the evaluation function with random playouts. Instead of needing a heuristic that scores a board position, MCTS simulates complete games from a position — randomly or with a simple policy — and uses the win/loss outcome directly. If you win 700 out of 1000 random games from a position, that position is estimated as strong. This is why MCTS works so well in Go and other domains where good evaluation functions do not exist: it requires no domain-specific scoring knowledge.
Question 2 Multiple Choice
After running MCTS for a fixed time budget, how do you select which move to actually play?
AChoose the child of the root with the highest UCB1 score, since that balances value and exploration
BChoose the child of the root that was visited most often, as visit count reflects accumulated confidence
CChoose the child of the root with the highest average reward, regardless of visit count
DChoose a random child, weighted by each child's average reward
At decision time, you choose the most-visited child of the root — not the highest UCB1 score and not the highest average reward. The UCB1 formula is used during tree traversal to balance exploration and exploitation while building the tree. But at the end, visit count is the most reliable signal: a move that has been visited many times has had its average reward estimate refined by many simulations. A move with high average reward but few visits might be a statistical fluke. The most-visited child represents the algorithm's most confident recommendation.
Question 3 True / False
MCTS requires that the game be played to a terminal state before any useful information is obtained, meaning it can seldom return a move recommendation until the search is complete.
TTrue
FFalse
Answer: False
MCTS is an 'anytime' algorithm — at any point during the search, the most-visited child of the root is a valid move recommendation. You can stop after 10 iterations or 10 million, and the algorithm will give you its best current estimate. This is a significant practical advantage: in time-constrained settings (like competitive game play), you run MCTS for the available time budget and then immediately use the current recommendation. The recommendation simply improves with more iterations.
Question 4 True / False
The UCB1 formula in MCTS selects moves with high average reward but also adds an exploration bonus for moves that have been visited less often.
TTrue
FFalse
Answer: True
UCB1 = (average reward) + C × √(ln(parent visits) / child visits). The first term exploits what is currently known — prefer moves that have won often. The second term explores — the bonus grows as a child is visited less relative to its parent, ensuring that underexplored moves are eventually tried. This mirrors the exploration-exploitation tradeoff from multi-armed bandit problems. Without the exploration bonus, MCTS would greedily concentrate on the first move that looks good and never discover better alternatives.
Question 5 Short Answer
Explain how the UCB formula in MCTS prevents the algorithm from permanently ignoring a move that happened to lose in its first few simulations.
Think about your answer, then reveal below.
Model answer: The UCB1 formula includes an exploration bonus that grows as a node's visit count falls behind its siblings. When a move has been tried only a few times, the exploration term is large, making UCB1 artificially inflate its apparent value and forcing the algorithm to revisit it. Only after a move accumulates enough visits to give its average reward a statistically reliable estimate will the exploration bonus shrink relative to the exploitation term. This guarantees that every legal move is eventually explored, and clearly bad moves are only 'abandoned' gradually — they still get occasional visits, but far fewer than promising moves.
This is the core contribution of UCB to MCTS. Pure exploitation (always pick the highest average reward) gets stuck on local optima. Pure exploration (visit everything equally) wastes effort on clearly bad moves. UCB1 — adapted from the multi-armed bandit literature — is provably near-optimal for balancing these concerns, ensuring the total regret grows only logarithmically with the number of iterations.