BThe next state depends only on the current state and action, not on history
CThe optimal policy must be deterministic
DTransition probabilities are uniform across all actions
The Markov property states that the future (next state and reward) depends only on the current state and action, not on the full history of states and actions. This makes MDPs tractable — you don't need to store or remember the entire sequence of past events to make optimal decisions.
Question 2 True / False
In an MDP with a discount factor γ = 1 and an infinite horizon, value iteration is expected to converge to the optimal value function in a finite number of iterations.
TTrue
FFalse
Answer: False
When γ = 1 there is no geometric contraction, so the Bellman operator is not a contraction mapping and value iteration may fail to converge in an infinite-horizon setting. With γ < 1, the contraction mapping property guarantees convergence at a rate of γ per iteration. γ = 1 is safe only in episodic (finite-horizon) MDPs where episodes always terminate.
Question 3 Short Answer
What is the difference between a policy and a value function in an MDP, and how are they related?
Think about your answer, then reveal below.
Model answer: A policy maps states to actions (or probability distributions over actions) and specifies the agent's behavior. A value function assigns each state a scalar representing the expected cumulative discounted reward when following a particular policy from that state. They are related through the Bellman equations: given a policy, its value function can be computed by policy evaluation; given a value function, an improved policy can be extracted by choosing the action that maximizes expected future value.
This relationship is the engine of both policy iteration and value iteration. Policy iteration alternates between evaluating the current policy's value function and improving the policy greedily with respect to it; value iteration collapses both steps into one Bellman backup. Understanding that a policy and a value function are complementary representations of agent behavior is the conceptual foundation of reinforcement learning.