A topic in the Open Knowledge Graph — a free, open map of 15,290 topics and the order to learn them in.

Multilayer Perceptrons (MLPs)

Graduate Depth 93 in the knowledge graph ☐ I know this ☆ Set as goal

72topics build on this

643prerequisites beneath it

Backpropagation Algorithm Neural Network Fundamentals +2 more→→Activation Functions in Neural Networks Vanishing Gradient Problem

Core Idea

Multilayer perceptrons stack fully-connected layers with nonlinear activations (ReLU, tanh, sigmoid) to learn complex nonlinear functions. The universal approximation theorem guarantees that MLPs with one hidden layer can approximate any continuous function, but deep networks learn hierarchical features more efficiently and require fewer parameters than shallow networks.

How It's Best Learned

Train MLPs on XOR and other nonlinear problems to understand why hidden layers are necessary, then observe how depth affects learning efficiency.

Explainer

From your study of basic neural networks and backpropagation, you know that a single neuron computes a weighted sum of its inputs, adds a bias, and passes the result through an activation function. A single layer of such neurons can only learn linear decision boundaries — it literally draws straight lines (or hyperplanes) through the input space. The XOR problem is the classic demonstration of this limitation: no single straight line can separate the inputs (0,0) and (1,1) from (0,1) and (1,0). A multilayer perceptron solves this by stacking layers, where the output of one layer becomes the input to the next.

The key insight is what happens in the hidden layers — the layers between input and output. Each neuron in a hidden layer applies a nonlinear activation function (such as ReLU, which outputs zero for negative inputs and the input itself for positive ones, or sigmoid, which squashes values to the range 0–1). Without nonlinearity, stacking layers would be pointless: a composition of linear functions is still linear, so ten layers would have no more representational power than one. The nonlinearity allows each layer to carve the input space into increasingly complex regions. The first hidden layer might learn simple features (edges in an image, individual word patterns in text), and subsequent layers combine those features into higher-level abstractions (shapes, phrases, objects).

The universal approximation theorem guarantees that an MLP with even a single hidden layer containing enough neurons can approximate any continuous function to arbitrary precision. This sounds like depth is unnecessary, but "enough neurons" can mean an astronomically large number. In practice, deep networks — those with multiple hidden layers — learn the same functions with far fewer total parameters because they compose simple features hierarchically. Think of it like building with LEGO: you could theoretically construct any shape from a single layer of tiny bricks laid flat, but it is vastly more efficient to stack layers and build upward. Each layer of a deep MLP reuses features learned by the previous layer rather than learning everything from scratch.

Training an MLP means using backpropagation to compute how much each weight contributed to the error, then adjusting weights via gradient descent. The matrix multiplication you know from linear algebra is central here: the forward pass through each layer is a matrix-vector product (weights times inputs) followed by the activation function, and the backward pass propagates gradients through the transpose of those same weight matrices. The architecture choices — how many hidden layers, how many neurons per layer, which activation function — determine the network's capacity and training dynamics. Too few neurons and the network underfits; too many and it may overfit or become difficult to train, which connects directly to challenges like the vanishing gradient problem you will encounter next.

Practice Questions 5 questions

Prerequisite Chain

Understanding Zero → The Number Zero → Counting to Five → Counting to 10 → Counting to 20 → Counting a Set of Objects Up to 20 → Cardinality: The Last Number Counted → Matching Numerals to Quantities → Subitizing Small Quantities → Addition Within 10 → Number Bonds to 10 → Addition Within 20 → Doubles and Near Doubles → Doubles Facts Within 10 → Near Doubles Facts Within 20 → Mental Math Strategies for Addition → Mental Math: Adding and Subtracting Tens → Addition Within 100 → Repeated Addition as Multiplication → Multiplication as Equal Groups → Multiplication: Arrays → Basic Multiplication Facts (0s, 1s, 2s, 5s, 10s) → Multiplication Facts Within 100 → Division as Equal Sharing → Division as Grouping (Measurement Division) → Division: Grouping (Repeated Subtraction) Model → Division: Fair Sharing Model → Division as Equal Sharing → Division as Grouping → Basic Division Facts → Division Facts Within 100 → Multiplication and Division Fact Families → Relationship Between Multiplication and Division → Division Facts as Inverse of Multiplication → Remainders and Quotients in Division → Division Word Problems → Multi-Step Word Problems → Solving Multi-Step Word Problems → Multiplication Word Problems → Division Word Problems → Introduction to Long Division → Factors and Multiples → Prime and Composite Numbers → Equivalent Fractions → Relating Fractions and Decimals → Decimal Place Value → Integers and the Number Line → Comparing and Ordering Integers → Absolute Value → Adding Integers → Subtracting Integers → Multiplying Integers → Dividing Integers → Unit Rates → Proportions → Percent Concept → Converting Between Fractions, Decimals, and Percents → Operations with Rational Numbers → Two-Step Equations → Solving Multi-Step Equations → Equations with Variables on Both Sides → Angle Pairs: Complementary, Supplementary, and Vertical → Parallel Lines and Transversals → Corresponding Angles → Alternate Interior Angles → Triangle Angle Sum Theorem → Exterior Angle Theorem → Triangle Inequality Theorem → Similar Triangles: AA Similarity → Similar Triangles: SSS and SAS Similarity → Proportions in Similar Triangles → Right Triangle Trigonometry Introduction → Sine, Cosine, and Tangent Ratios → Trigonometric Ratios Review → Radian Measure → Converting Between Degrees and Radians → The Unit Circle → Graphing Sine and Cosine → Graphing Tangent and Reciprocal Trigonometric Functions → Derivatives of Trigonometric Functions → Antiderivatives → Indefinite Integrals → Basic Integration Rules → Riemann Sums → Definite Integral Definition → Probability Density Functions and Continuous Distributions → Cumulative Distribution Functions → Continuous Random Variables → Probability Density Functions → Expected Value → Linear Regression in Machine Learning → Neural Network Fundamentals → Backpropagation Algorithm → Multilayer Perceptrons (MLPs)

Longest path: 94 steps · 643 total prerequisite topics

Prerequisites (4)

Neural Network Fundamentalshard Backpropagation Algorithmhard Matrix Multiplicationsoft Vectors in R^nsoft

Leads To (2)

Activation Functions in Neural Networkshard Vanishing Gradient Problemhard