Policy Networks and Policy Gradients

Research Depth 74 in the knowledge graph I know this Set as goal
reinforcement-learning policy-based actor-methods policy-gradient

Core Idea

Policy networks directly parameterize the policy π(a|s) using a neural network, enabling learning for continuous action spaces and stochastic policies. Policy gradient algorithms estimate policy parameter gradients using trajectory samples; the REINFORCE algorithm uses returns, while more sophisticated methods reduce variance through baselines and advantage functions.

How It's Best Learned

Implement REINFORCE and train a policy network on a continuous control task, then add a baseline to reduce variance and observe faster convergence.

Explainer

From your work on policy gradient methods, you know the core idea: adjust the policy parameters so that actions leading to higher returns become more probable. From neural networks, you know how to build flexible function approximators that map inputs to outputs through layers of learned transformations. A policy network combines these two ideas — it is a neural network that takes a state as input and outputs a probability distribution over actions, directly representing the policy π(a|s; θ) where θ are the network weights.

The simplest policy gradient algorithm is REINFORCE. After the agent completes an episode, REINFORCE computes the return (cumulative discounted reward) for each time step, then updates the network weights to make actions with higher returns more likely. The gradient has an intuitive form: ∇θ log π(aₜ|sₜ; θ) × Gₜ. The log-probability gradient points in the direction that would increase the probability of action aₜ, and the return Gₜ scales how far you step in that direction. Good actions get reinforced; bad actions get suppressed. Because the network outputs a full probability distribution — perhaps a softmax over discrete actions or the parameters of a Gaussian for continuous actions — this approach naturally handles stochastic policies and continuous action spaces that value-based methods struggle with.

The central challenge with REINFORCE is high variance. Returns from individual episodes fluctuate wildly — a lucky rollout might give a high return to a mediocre action, and an unlucky one might penalize a good action. This noise makes learning slow and unstable. The standard fix is to subtract a baseline from the return: instead of scaling the gradient by Gₜ, you scale by Gₜ − b(sₜ), where b is an estimate of the expected return from state sₜ. This does not change the expected gradient (the math works out to be unbiased) but dramatically reduces variance. The quantity Gₜ − b(sₜ) is called the advantage — it tells you whether this action was better or worse than average for this state, which is a much cleaner learning signal than the raw return.

In practice, the baseline is often a separate neural network — a value network V(s; φ) — trained alongside the policy network. This leads naturally to actor-critic architectures, where the "actor" (policy network) decides what to do and the "critic" (value network) evaluates how good the decision was. Policy networks have proven essential for complex control tasks — robotic locomotion, game playing, and any domain where the action space is continuous or the optimal behavior is inherently stochastic. Their ability to directly optimize the quantity you care about (expected return) without needing to enumerate all possible actions makes them a cornerstone of modern reinforcement learning.

Practice Questions 5 questions

Prerequisite Chain

Counting to 10Counting to 20Understanding ZeroThe Number ZeroCounting to FiveOne-to-One CorrespondenceCombining Small Groups Within 5Addition Within 10Addition Within 20Two-Digit Addition Without RegroupingTwo-Digit Addition with RegroupingAddition Within 100Repeated Addition as MultiplicationMultiplication Facts Within 100Division as Equal SharingDivision as Grouping (Measurement Division)Division: Grouping (Repeated Subtraction) ModelDivision: Fair Sharing ModelDivision as Equal SharingDivision as GroupingBasic Division FactsDivision Facts Within 100Two-Digit by One-Digit DivisionDivision with RemaindersRemainders and Quotients in DivisionDivision Word ProblemsIntroduction to Long DivisionFactors and MultiplesPrime and Composite NumbersEquivalent FractionsRelating Fractions and DecimalsDecimal Place ValueReading and Writing DecimalsComparing and Ordering DecimalsAdding and Subtracting DecimalsMultiplying DecimalsDividing DecimalsDividing FractionsMixed Number ArithmeticOrder of OperationsInteger Order of OperationsVariable ExpressionsCombining Like TermsOne-Step EquationsTwo-Step EquationsSolving Multi-Step EquationsEquations with Variables on Both SidesLiteral EquationsSlope-Intercept FormPoint-Slope FormWriting Linear EquationsParallel and Perpendicular Line SlopesGraphing Linear EquationsPiecewise FunctionsOne-Sided LimitsContinuity DefinitionLimit Definition of the DerivativePower RuleConstant Multiple and Sum/Difference RulesProduct RuleChain RuleHigher-Order DerivativesConcavity and Inflection PointsSecond Derivative TestCurve SketchingOptimization ProblemsCritical Points of Multivariable FunctionsCritical Points and Classification of ExtremaSecond Partial Test for Local Extrema (Hessian)The Hessian Matrix and Second Derivative TestUnconstrained Optimization: Finding ExtremaOptimization in Multiple VariablesIntroduction to Reinforcement LearningPolicy Gradient MethodsPolicy Networks and Policy Gradients

Longest path: 75 steps · 538 total prerequisite topics

Prerequisites (2)

Leads To (0)

No topics depend on this one yet.