Markov Decision Processes

Research Depth 71 in the knowledge graph I know this Set as goal
Unlocks 19 downstream topics
decision-making markov-models optimization

Core Idea

MDPs extend Markov chains with actions and rewards, modeling sequential decision-making under uncertainty. States, actions, transition probabilities, and reward functions define an MDP. Value iteration and policy iteration compute optimal policies maximizing expected cumulative reward.

Explainer

A Markov chain describes a system that hops between states according to fixed transition probabilities — you have no control over what happens next. A Markov Decision Process (MDP) adds two new ingredients: *actions* and *rewards*. At each step, an agent observes the current state, chooses an action, receives a reward, and transitions to a new state. The key insight is that the agent's goal is not to respond to a single event but to maximize the *total* reward accumulated over time — which means current decisions must account for their downstream consequences.

The MDP is defined by four objects: a set of states S, a set of actions A, a transition function T(s, a, s') giving the probability of reaching s' when taking action a in state s, and a reward function R(s, a) giving the immediate payoff. The Markov property — that T depends only on the current state and action, not on history — is what makes this tractable. Without it, an agent would need to track the entire sequence of past states to plan optimally.

To find the best behavior, we define a *value function* V(s): the maximum expected cumulative reward achievable from state s. The Bellman optimality equation expresses V(s) recursively — the value of a state is the best immediate reward plus the discounted value of the best next state. Value iteration repeatedly applies this equation until V converges, then extracts the optimal policy by choosing, in each state, the action that achieves that maximum. Policy iteration takes a different route: start with any policy, evaluate it (compute its value function), improve it greedily, and repeat — provably converging to the optimal policy in finite steps.

The discount factor γ ∈ [0, 1) controls how much the agent values future rewards relative to immediate ones. γ close to 1 means the agent is patient and plans far ahead; γ close to 0 means it is myopic. Mathematically, γ < 1 also guarantees that the infinite sum of rewards converges to a finite number, and it makes the Bellman operator a contraction — the property that makes value iteration converge.

MDPs are the theoretical backbone of reinforcement learning. Real-world RL algorithms like Q-learning and policy gradient methods can be understood as solving MDPs when the transition and reward functions are unknown and must be estimated from experience. Mastering the MDP framework — its structure, its Bellman equations, and its solution algorithms — gives you the conceptual tools to reason about any sequential decision problem under uncertainty.

Practice Questions 3 questions

Prerequisite Chain

Counting to 10Counting to 20Understanding ZeroThe Number ZeroCounting to FiveOne-to-One CorrespondenceCombining Small Groups Within 5Addition Within 10Addition Within 20Two-Digit Addition Without RegroupingTwo-Digit Addition with RegroupingAddition Within 100Repeated Addition as MultiplicationMultiplication Facts Within 100Division as Equal SharingDivision as Grouping (Measurement Division)Division: Grouping (Repeated Subtraction) ModelDivision: Fair Sharing ModelDivision as Equal SharingDivision as GroupingBasic Division FactsDivision Facts Within 100Two-Digit by One-Digit DivisionDivision with RemaindersRemainders and Quotients in DivisionDivision Word ProblemsIntroduction to Long DivisionFactors and MultiplesPrime and Composite NumbersEquivalent FractionsRelating Fractions and DecimalsDecimal Place ValueReading and Writing DecimalsComparing and Ordering DecimalsAdding and Subtracting DecimalsMultiplying DecimalsDividing DecimalsDividing FractionsMixed Number ArithmeticOrder of OperationsInteger Order of OperationsVariable ExpressionsCombining Like TermsOne-Step EquationsTwo-Step EquationsSolving Multi-Step EquationsEquations with Variables on Both SidesAngle Pairs: Complementary, Supplementary, and VerticalParallel Lines and TransversalsCorresponding AnglesAlternate Interior AnglesTriangle Angle Sum TheoremExterior Angle TheoremTriangle Inequality TheoremSimilar Triangles: AA SimilaritySimilar Triangles: SSS and SAS SimilarityProportions in Similar TrianglesRight Triangle Trigonometry IntroductionTrigonometric Ratios ReviewRadian MeasureConverting Between Degrees and RadiansThe Unit CircleGraphing Sine and CosineGraphing Tangent and Reciprocal Trigonometric FunctionsDerivatives of Trigonometric FunctionsAntiderivativesIterated Integrals and Fubini's TheoremJoint Distributions and Marginals (Rigorous)Independence of Sigma-AlgebrasConditional ExpectationMarkov ChainsMarkov Decision Processes

Longest path: 72 steps · 428 total prerequisite topics

Prerequisites (6)

Leads To (4)