A topic in the Open Knowledge Graph — a free, open map of 15,290 topics and the order to learn them in.

Markov Decision Processes

Research Depth 96 in the knowledge graph ☐ I know this ☆ Set as goal

27topics build on this

612prerequisites beneath it

Dynamic Programming Markov Chains +4 more→→Introduction to Reinforcement Learning Model-Based Reinforcement Learning +2 more

Core Idea

MDPs extend Markov chains with actions and rewards, modeling sequential decision-making under uncertainty. States, actions, transition probabilities, and reward functions define an MDP. Value iteration and policy iteration compute optimal policies maximizing expected cumulative reward.

Explainer

A Markov chain describes a system that hops between states according to fixed transition probabilities — you have no control over what happens next. A Markov Decision Process (MDP) adds two new ingredients: *actions* and *rewards*. At each step, an agent observes the current state, chooses an action, receives a reward, and transitions to a new state. The key insight is that the agent's goal is not to respond to a single event but to maximize the *total* reward accumulated over time — which means current decisions must account for their downstream consequences.

The MDP is defined by four objects: a set of states S, a set of actions A, a transition function T(s, a, s') giving the probability of reaching s' when taking action a in state s, and a reward function R(s, a) giving the immediate payoff. The Markov property — that T depends only on the current state and action, not on history — is what makes this tractable. Without it, an agent would need to track the entire sequence of past states to plan optimally.

To find the best behavior, we define a *value function* V(s): the maximum expected cumulative reward achievable from state s. The Bellman optimality equation expresses V(s) recursively — the value of a state is the best immediate reward plus the discounted value of the best next state. Value iteration repeatedly applies this equation until V converges, then extracts the optimal policy by choosing, in each state, the action that achieves that maximum. Policy iteration takes a different route: start with any policy, evaluate it (compute its value function), improve it greedily, and repeat — provably converging to the optimal policy in finite steps.

The discount factor γ ∈ [0, 1) controls how much the agent values future rewards relative to immediate ones. γ close to 1 means the agent is patient and plans far ahead; γ close to 0 means it is myopic. Mathematically, γ < 1 also guarantees that the infinite sum of rewards converges to a finite number, and it makes the Bellman operator a contraction — the property that makes value iteration converge.

MDPs are the theoretical backbone of reinforcement learning. Real-world RL algorithms like Q-learning and policy gradient methods can be understood as solving MDPs when the transition and reward functions are unknown and must be estimated from experience. Mastering the MDP framework — its structure, its Bellman equations, and its solution algorithms — gives you the conceptual tools to reason about any sequential decision problem under uncertainty.

Practice Questions 3 questions

Prerequisite Chain

Understanding Zero → The Number Zero → Counting to Five → Counting to 10 → Counting to 20 → Counting a Set of Objects Up to 20 → Cardinality: The Last Number Counted → Matching Numerals to Quantities → Subitizing Small Quantities → Addition Within 10 → Number Bonds to 10 → Addition Within 20 → Doubles and Near Doubles → Doubles Facts Within 10 → Near Doubles Facts Within 20 → Mental Math Strategies for Addition → Mental Math: Adding and Subtracting Tens → Addition Within 100 → Repeated Addition as Multiplication → Multiplication as Equal Groups → Multiplication: Arrays → Basic Multiplication Facts (0s, 1s, 2s, 5s, 10s) → Multiplication Facts Within 100 → Division as Equal Sharing → Division as Grouping (Measurement Division) → Division: Grouping (Repeated Subtraction) Model → Division: Fair Sharing Model → Division as Equal Sharing → Division as Grouping → Basic Division Facts → Division Facts Within 100 → Multiplication and Division Fact Families → Relationship Between Multiplication and Division → Division Facts as Inverse of Multiplication → Remainders and Quotients in Division → Division Word Problems → Multi-Step Word Problems → Solving Multi-Step Word Problems → Multiplication Word Problems → Division Word Problems → Introduction to Long Division → Factors and Multiples → Prime and Composite Numbers → Equivalent Fractions → Relating Fractions and Decimals → Decimal Place Value → Integers and the Number Line → Comparing and Ordering Integers → Absolute Value → Adding Integers → Subtracting Integers → Multiplying Integers → Dividing Integers → Unit Rates → Proportions → Percent Concept → Converting Between Fractions, Decimals, and Percents → Operations with Rational Numbers → Two-Step Equations → Solving Multi-Step Equations → Equations with Variables on Both Sides → Angle Pairs: Complementary, Supplementary, and Vertical → Parallel Lines and Transversals → Corresponding Angles → Alternate Interior Angles → Triangle Angle Sum Theorem → Exterior Angle Theorem → Triangle Inequality Theorem → Similar Triangles: AA Similarity → Similar Triangles: SSS and SAS Similarity → Proportions in Similar Triangles → Right Triangle Trigonometry Introduction → Sine, Cosine, and Tangent Ratios → Trigonometric Ratios Review → Radian Measure → Converting Between Degrees and Radians → The Unit Circle → Graphing Sine and Cosine → Graphing Tangent and Reciprocal Trigonometric Functions → Derivatives of Trigonometric Functions → Antiderivatives → Indefinite Integrals → Basic Integration Rules → Riemann Sums → Definite Integral Definition → Probability Density Functions and Continuous Distributions → Cumulative Distribution Functions → Continuous Random Variables → Probability Density Functions → Expected Value → Weak Law of Large Numbers → Probability Axioms and Rules → Conditional Probability → Conditional Distributions → Conditional Expectation → Markov Chains → Markov Decision Processes

Longest path: 97 steps · 612 total prerequisite topics

Prerequisites (6)

Markov Chainshard Dynamic Programminghard Probability Axioms and Rulessoft Probability Axiomssoft Expected Valuesoft Expected Value and Variancesoft

Leads To (4)

Introduction to Reinforcement Learninghard Model-Based Reinforcement Learninghard Q-Learning Algorithmhard Temporal Difference Learninghard