Multilayer Perceptrons (MLPs)

Graduate Depth 66 in the knowledge graph I know this Set as goal
Unlocks 62 downstream topics
neural-networks deep-learning supervised-learning universal-approximation

Core Idea

Multilayer perceptrons stack fully-connected layers with nonlinear activations (ReLU, tanh, sigmoid) to learn complex nonlinear functions. The universal approximation theorem guarantees that MLPs with one hidden layer can approximate any continuous function, but deep networks learn hierarchical features more efficiently and require fewer parameters than shallow networks.

How It's Best Learned

Train MLPs on XOR and other nonlinear problems to understand why hidden layers are necessary, then observe how depth affects learning efficiency.

Explainer

From your study of basic neural networks and backpropagation, you know that a single neuron computes a weighted sum of its inputs, adds a bias, and passes the result through an activation function. A single layer of such neurons can only learn linear decision boundaries — it literally draws straight lines (or hyperplanes) through the input space. The XOR problem is the classic demonstration of this limitation: no single straight line can separate the inputs (0,0) and (1,1) from (0,1) and (1,0). A multilayer perceptron solves this by stacking layers, where the output of one layer becomes the input to the next.

The key insight is what happens in the hidden layers — the layers between input and output. Each neuron in a hidden layer applies a nonlinear activation function (such as ReLU, which outputs zero for negative inputs and the input itself for positive ones, or sigmoid, which squashes values to the range 0–1). Without nonlinearity, stacking layers would be pointless: a composition of linear functions is still linear, so ten layers would have no more representational power than one. The nonlinearity allows each layer to carve the input space into increasingly complex regions. The first hidden layer might learn simple features (edges in an image, individual word patterns in text), and subsequent layers combine those features into higher-level abstractions (shapes, phrases, objects).

The universal approximation theorem guarantees that an MLP with even a single hidden layer containing enough neurons can approximate any continuous function to arbitrary precision. This sounds like depth is unnecessary, but "enough neurons" can mean an astronomically large number. In practice, deep networks — those with multiple hidden layers — learn the same functions with far fewer total parameters because they compose simple features hierarchically. Think of it like building with LEGO: you could theoretically construct any shape from a single layer of tiny bricks laid flat, but it is vastly more efficient to stack layers and build upward. Each layer of a deep MLP reuses features learned by the previous layer rather than learning everything from scratch.

Training an MLP means using backpropagation to compute how much each weight contributed to the error, then adjusting weights via gradient descent. The matrix multiplication you know from linear algebra is central here: the forward pass through each layer is a matrix-vector product (weights times inputs) followed by the activation function, and the backward pass propagates gradients through the transpose of those same weight matrices. The architecture choices — how many hidden layers, how many neurons per layer, which activation function — determine the network's capacity and training dynamics. Too few neurons and the network underfits; too many and it may overfit or become difficult to train, which connects directly to challenges like the vanishing gradient problem you will encounter next.

Practice Questions 5 questions

Prerequisite Chain

Counting to 10Counting to 20Understanding ZeroThe Number ZeroCounting to FiveOne-to-One CorrespondenceCombining Small Groups Within 5Addition Within 10Addition Within 20Two-Digit Addition Without RegroupingTwo-Digit Addition with RegroupingAddition Within 100Repeated Addition as MultiplicationMultiplication Facts Within 100Division as Equal SharingDivision as Grouping (Measurement Division)Division: Grouping (Repeated Subtraction) ModelDivision: Fair Sharing ModelDivision as Equal SharingDivision as GroupingBasic Division FactsDivision Facts Within 100Two-Digit by One-Digit DivisionDivision with RemaindersRemainders and Quotients in DivisionDivision Word ProblemsIntroduction to Long DivisionFactors and MultiplesPrime and Composite NumbersEquivalent FractionsRelating Fractions and DecimalsDecimal Place ValueReading and Writing DecimalsComparing and Ordering DecimalsAdding and Subtracting DecimalsMultiplying DecimalsDividing DecimalsDividing FractionsMixed Number ArithmeticOrder of OperationsInteger Order of OperationsVariable ExpressionsCombining Like TermsOne-Step EquationsTwo-Step EquationsSolving Multi-Step EquationsEquations with Variables on Both SidesLiteral EquationsSlope-Intercept FormPoint-Slope FormWriting Linear EquationsParallel and Perpendicular Line SlopesGraphing Linear EquationsPiecewise FunctionsStep FunctionsComposition of FunctionsInverse FunctionsRadical Functions and GraphsRational ExponentsExponential Functions and GraphsGeometric Sequences and SeriesSigma NotationExpected ValueLinear Regression in Machine LearningNeural Network FundamentalsBackpropagation AlgorithmMultilayer Perceptrons (MLPs)

Longest path: 67 steps · 407 total prerequisite topics

Prerequisites (4)

Leads To (2)