← Graph View All Domains

A topic in the Open Knowledge Graph — a free, open map of 15,290 topics and the order to learn them in.

Policy Networks and Policy Gradients

Research Depth 99 in the knowledge graph ☐ I know this ☆ Set as goal

21topics build on this

761prerequisites beneath it

See this on the map →

Neural Network Fundamentals Policy Gradient Methods→→Actor-Critic Methods

Core Idea

Policy networks directly parameterize the policy π(a|s) using a neural network, enabling learning for continuous action spaces and stochastic policies. Policy gradient algorithms estimate policy parameter gradients using trajectory samples; the REINFORCE algorithm uses returns, while more sophisticated methods reduce variance through baselines and advantage functions.

How It's Best Learned

Implement REINFORCE and train a policy network on a continuous control task, then add a baseline to reduce variance and observe faster convergence.

Explainer

From your work on policy gradient methods, you know the core idea: adjust the policy parameters so that actions leading to higher returns become more probable. From neural networks, you know how to build flexible function approximators that map inputs to outputs through layers of learned transformations. A policy network combines these two ideas — it is a neural network that takes a state as input and outputs a probability distribution over actions, directly representing the policy π(a|s; θ) where θ are the network weights.

The simplest policy gradient algorithm is REINFORCE. After the agent completes an episode, REINFORCE computes the return (cumulative discounted reward) for each time step, then updates the network weights to make actions with higher returns more likely. The gradient has an intuitive form: ∇θ log π(aₜ|sₜ; θ) × Gₜ. The log-probability gradient points in the direction that would increase the probability of action aₜ, and the return Gₜ scales how far you step in that direction. Good actions get reinforced; bad actions get suppressed. Because the network outputs a full probability distribution — perhaps a softmax over discrete actions or the parameters of a Gaussian for continuous actions — this approach naturally handles stochastic policies and continuous action spaces that value-based methods struggle with.

The central challenge with REINFORCE is high variance. Returns from individual episodes fluctuate wildly — a lucky rollout might give a high return to a mediocre action, and an unlucky one might penalize a good action. This noise makes learning slow and unstable. The standard fix is to subtract a baseline from the return: instead of scaling the gradient by Gₜ, you scale by Gₜ − b(sₜ), where b is an estimate of the expected return from state sₜ. This does not change the expected gradient (the math works out to be unbiased) but dramatically reduces variance. The quantity Gₜ − b(sₜ) is called the advantage — it tells you whether this action was better or worse than average for this state, which is a much cleaner learning signal than the raw return.

In practice, the baseline is often a separate neural network — a value network V(s; φ) — trained alongside the policy network. This leads naturally to actor-critic architectures, where the "actor" (policy network) decides what to do and the "critic" (value network) evaluates how good the decision was. Policy networks have proven essential for complex control tasks — robotic locomotion, game playing, and any domain where the action space is continuous or the optimal behavior is inherently stochastic. Their ability to directly optimize the quantity you care about (expected return) without needing to enumerate all possible actions makes them a cornerstone of modern reinforcement learning.

Practice Questions 5 questions

Prerequisite Chain

Understanding Zero → The Number Zero → Counting to Five → Counting to 10 → Counting to 20 → Counting a Set of Objects Up to 20 → Cardinality: The Last Number Counted → Matching Numerals to Quantities → Subitizing Small Quantities → Addition Within 10 → Number Bonds to 10 → Addition Within 20 → Doubles and Near Doubles → Doubles Facts Within 10 → Near Doubles Facts Within 20 → Mental Math Strategies for Addition → Mental Math: Adding and Subtracting Tens → Addition Within 100 → Repeated Addition as Multiplication → Multiplication as Equal Groups → Multiplication: Arrays → Basic Multiplication Facts (0s, 1s, 2s, 5s, 10s) → Multiplication Facts Within 100 → Division as Equal Sharing → Division as Grouping (Measurement Division) → Division: Grouping (Repeated Subtraction) Model → Division: Fair Sharing Model → Division as Equal Sharing → Division as Grouping → Basic Division Facts → Division Facts Within 100 → Multiplication and Division Fact Families → Relationship Between Multiplication and Division → Division Facts as Inverse of Multiplication → Remainders and Quotients in Division → Division Word Problems → Multi-Step Word Problems → Solving Multi-Step Word Problems → Multiplication Word Problems → Division Word Problems → Introduction to Long Division → Factors and Multiples → Prime and Composite Numbers → Equivalent Fractions → Relating Fractions and Decimals → Decimal Place Value → Integers and the Number Line → Comparing and Ordering Integers → Absolute Value → Adding Integers → Subtracting Integers → Multiplying Integers → Dividing Integers → Unit Rates → Proportions → Percent Concept → Converting Between Fractions, Decimals, and Percents → Operations with Rational Numbers → Two-Step Equations → Solving Multi-Step Equations → Equations with Variables on Both Sides → Angle Pairs: Complementary, Supplementary, and Vertical → Parallel Lines and Transversals → Corresponding Angles → Alternate Interior Angles → Triangle Angle Sum Theorem → Exterior Angle Theorem → Triangle Inequality Theorem → Similar Triangles: AA Similarity → Similar Triangles: SSS and SAS Similarity → Proportions in Similar Triangles → Right Triangle Trigonometry Introduction → Sine, Cosine, and Tangent Ratios → Trigonometric Ratios Review → Radian Measure → Converting Between Degrees and Radians → The Unit Circle → Graphing Sine and Cosine → Graphing Tangent and Reciprocal Trigonometric Functions → Derivatives of Trigonometric Functions → Antiderivatives → Indefinite Integrals → Basic Integration Rules → Riemann Sums → Definite Integral Definition → Probability Density Functions and Continuous Distributions → Cumulative Distribution Functions → Continuous Random Variables → Probability Density Functions → Expected Value → Weak Law of Large Numbers → Probability Axioms and Rules → Conditional Probability → Conditional Distributions → Conditional Expectation → Markov Chains → Markov Decision Processes → Introduction to Reinforcement Learning → Policy Gradient Methods → Policy Networks and Policy Gradients

Longest path: 100 steps · 761 total prerequisite topics

Prerequisites (2)

Policy Gradient Methodshard Neural Network Fundamentalshard

Leads To (1)

Actor-Critic Methodshard