Activation Functions in Neural Networks

Graduate Depth 67 in the knowledge graph I know this Set as goal
Unlocks 20 downstream topics
activation nonlinearity neural-networks

Core Idea

Activation functions introduce nonlinearity into neural networks, enabling them to learn complex patterns beyond linear transformations. ReLU dominates modern networks for hidden layers due to computational efficiency and reduced vanishing gradient. Sigmoid and tanh are historically important. Output layer activation depends on task: softmax for multi-class, sigmoid for binary.

Explainer

From your study of multilayer perceptrons, you know that a neural network is built from layers of neurons, each computing a weighted sum of its inputs plus a bias. Without activation functions, stacking layers would be pointless — a composition of linear transformations is just another linear transformation. No matter how many layers you add, the network could only learn linear decision boundaries. The activation function applied after each neuron's weighted sum is what breaks this linearity and gives deep networks their power to approximate arbitrarily complex functions.

The sigmoid function σ(x) = 1/(1 + e^(−x)) was the original workhorse activation. It squashes any input to the range (0, 1), which has a nice probabilistic interpretation and smooth gradients everywhere. The closely related tanh function maps inputs to (−1, 1), centering outputs around zero, which often helps training converge faster. However, both functions suffer from a critical problem: for large positive or negative inputs, the derivative approaches zero. During backpropagation, gradients get multiplied through many layers, and near-zero derivatives cause the gradient signal to vanish — the vanishing gradient problem. This makes deep networks with sigmoid or tanh very difficult to train, because early layers receive almost no learning signal.

The Rectified Linear Unit (ReLU), defined as f(x) = max(0, x), solved this problem with elegant simplicity. For positive inputs, the derivative is exactly 1 — gradients flow through without shrinking, no matter how deep the network. For negative inputs, the output and derivative are both 0, which creates sparsity (many neurons output zero at any given time) and reduces computation. ReLU's combination of computational cheapness, gradient-friendly behavior, and empirical effectiveness made it the default choice for hidden layers in modern deep learning. Its main weakness is the dying ReLU problem: if a neuron's weights drift so that its input is always negative, it outputs zero for all inputs and can never recover. Variants like Leaky ReLU (which allows a small slope for negative inputs instead of zero) and ELU address this.

Choosing the right activation for the output layer is a separate decision driven by the task, not by gradient flow. For binary classification, a sigmoid output gives a probability between 0 and 1. For multi-class classification, softmax converts a vector of raw scores into a probability distribution that sums to 1. For regression, a linear (identity) activation is standard because the output should be an unconstrained real number. Getting the output activation wrong — say, using ReLU for regression where targets can be negative — silently clips your predictions and degrades performance without any obvious error message, making it one of the most common beginner mistakes in neural network design.

Practice Questions 5 questions

Prerequisite Chain

Counting to 10Counting to 20Understanding ZeroThe Number ZeroCounting to FiveOne-to-One CorrespondenceCombining Small Groups Within 5Addition Within 10Addition Within 20Two-Digit Addition Without RegroupingTwo-Digit Addition with RegroupingAddition Within 100Repeated Addition as MultiplicationMultiplication Facts Within 100Division as Equal SharingDivision as Grouping (Measurement Division)Division: Grouping (Repeated Subtraction) ModelDivision: Fair Sharing ModelDivision as Equal SharingDivision as GroupingBasic Division FactsDivision Facts Within 100Two-Digit by One-Digit DivisionDivision with RemaindersRemainders and Quotients in DivisionDivision Word ProblemsIntroduction to Long DivisionFactors and MultiplesPrime and Composite NumbersEquivalent FractionsRelating Fractions and DecimalsDecimal Place ValueReading and Writing DecimalsComparing and Ordering DecimalsAdding and Subtracting DecimalsMultiplying DecimalsDividing DecimalsDividing FractionsMixed Number ArithmeticOrder of OperationsInteger Order of OperationsVariable ExpressionsCombining Like TermsOne-Step EquationsTwo-Step EquationsSolving Multi-Step EquationsEquations with Variables on Both SidesLiteral EquationsSlope-Intercept FormPoint-Slope FormWriting Linear EquationsParallel and Perpendicular Line SlopesGraphing Linear EquationsPiecewise FunctionsStep FunctionsComposition of FunctionsInverse FunctionsRadical Functions and GraphsRational ExponentsExponential Functions and GraphsGeometric Sequences and SeriesSigma NotationExpected ValueLinear Regression in Machine LearningNeural Network FundamentalsBackpropagation AlgorithmMultilayer Perceptrons (MLPs)Activation Functions in Neural Networks

Longest path: 68 steps · 408 total prerequisite topics

Prerequisites (5)

Leads To (2)