A topic in the Open Knowledge Graph — a free, open map of 15,290 topics and the order to learn them in.

Activation Functions in Neural Networks

Graduate Depth 94 in the knowledge graph ☐ I know this ☆ Set as goal

71topics build on this

644prerequisites beneath it

Multilayer Perceptrons (MLPs)Neural Network Fundamentals +3 more→→Convolutional Neural Networks Recurrent Neural Networks +1 more

Core Idea

Activation functions introduce nonlinearity into neural networks, enabling them to learn complex patterns beyond linear transformations. ReLU dominates modern networks for hidden layers due to computational efficiency and reduced vanishing gradient. Sigmoid and tanh are historically important. Output layer activation depends on task: softmax for multi-class, sigmoid for binary.

Explainer

From your study of multilayer perceptrons, you know that a neural network is built from layers of neurons, each computing a weighted sum of its inputs plus a bias. Without activation functions, stacking layers would be pointless — a composition of linear transformations is just another linear transformation. No matter how many layers you add, the network could only learn linear decision boundaries. The activation function applied after each neuron's weighted sum is what breaks this linearity and gives deep networks their power to approximate arbitrarily complex functions.

The sigmoid function σ(x) = 1/(1 + e^−x) was the original workhorse activation. It squashes any input to the range (0, 1), which has a nice probabilistic interpretation and smooth gradients everywhere. The closely related tanh function maps inputs to (−1, 1), centering outputs around zero, which often helps training converge faster. However, both functions suffer from a critical problem: for large positive or negative inputs, the derivative approaches zero. During backpropagation, gradients get multiplied through many layers, and near-zero derivatives cause the gradient signal to vanish — the vanishing gradient problem. This makes deep networks with sigmoid or tanh very difficult to train, because early layers receive almost no learning signal.

The Rectified Linear Unit (ReLU), defined as f(x) = max(0, x), solved this problem with elegant simplicity. For positive inputs, the derivative is exactly 1 — gradients flow through without shrinking, no matter how deep the network. For negative inputs, the output and derivative are both 0, which creates sparsity (many neurons output zero at any given time) and reduces computation. ReLU's combination of computational cheapness, gradient-friendly behavior, and empirical effectiveness made it the default choice for hidden layers in modern deep learning. Its main weakness is the dying ReLU problem: if a neuron's weights drift so that its input is always negative, it outputs zero for all inputs and can never recover. Variants like Leaky ReLU (which allows a small slope for negative inputs instead of zero) and ELU address this.

Choosing the right activation for the output layer is a separate decision driven by the task, not by gradient flow. For binary classification, a sigmoid output gives a probability between 0 and 1. For multi-class classification, softmax converts a vector of raw scores into a probability distribution that sums to 1. For regression, a linear (identity) activation is standard because the output should be an unconstrained real number. Getting the output activation wrong — say, using ReLU for regression where targets can be negative — silently clips your predictions and degrades performance without any obvious error message, making it one of the most common beginner mistakes in neural network design.

Practice Questions 5 questions

Prerequisite Chain

Understanding Zero → The Number Zero → Counting to Five → Counting to 10 → Counting to 20 → Counting a Set of Objects Up to 20 → Cardinality: The Last Number Counted → Matching Numerals to Quantities → Subitizing Small Quantities → Addition Within 10 → Number Bonds to 10 → Addition Within 20 → Doubles and Near Doubles → Doubles Facts Within 10 → Near Doubles Facts Within 20 → Mental Math Strategies for Addition → Mental Math: Adding and Subtracting Tens → Addition Within 100 → Repeated Addition as Multiplication → Multiplication as Equal Groups → Multiplication: Arrays → Basic Multiplication Facts (0s, 1s, 2s, 5s, 10s) → Multiplication Facts Within 100 → Division as Equal Sharing → Division as Grouping (Measurement Division) → Division: Grouping (Repeated Subtraction) Model → Division: Fair Sharing Model → Division as Equal Sharing → Division as Grouping → Basic Division Facts → Division Facts Within 100 → Multiplication and Division Fact Families → Relationship Between Multiplication and Division → Division Facts as Inverse of Multiplication → Remainders and Quotients in Division → Division Word Problems → Multi-Step Word Problems → Solving Multi-Step Word Problems → Multiplication Word Problems → Division Word Problems → Introduction to Long Division → Factors and Multiples → Prime and Composite Numbers → Equivalent Fractions → Relating Fractions and Decimals → Decimal Place Value → Integers and the Number Line → Comparing and Ordering Integers → Absolute Value → Adding Integers → Subtracting Integers → Multiplying Integers → Dividing Integers → Unit Rates → Proportions → Percent Concept → Converting Between Fractions, Decimals, and Percents → Operations with Rational Numbers → Two-Step Equations → Solving Multi-Step Equations → Equations with Variables on Both Sides → Angle Pairs: Complementary, Supplementary, and Vertical → Parallel Lines and Transversals → Corresponding Angles → Alternate Interior Angles → Triangle Angle Sum Theorem → Exterior Angle Theorem → Triangle Inequality Theorem → Similar Triangles: AA Similarity → Similar Triangles: SSS and SAS Similarity → Proportions in Similar Triangles → Right Triangle Trigonometry Introduction → Sine, Cosine, and Tangent Ratios → Trigonometric Ratios Review → Radian Measure → Converting Between Degrees and Radians → The Unit Circle → Graphing Sine and Cosine → Graphing Tangent and Reciprocal Trigonometric Functions → Derivatives of Trigonometric Functions → Antiderivatives → Indefinite Integrals → Basic Integration Rules → Riemann Sums → Definite Integral Definition → Probability Density Functions and Continuous Distributions → Cumulative Distribution Functions → Continuous Random Variables → Probability Density Functions → Expected Value → Linear Regression in Machine Learning → Neural Network Fundamentals → Backpropagation Algorithm → Multilayer Perceptrons (MLPs) → Activation Functions in Neural Networks

Longest path: 95 steps · 644 total prerequisite topics

Prerequisites (5)

Neural Network Fundamentalshard Multilayer Perceptrons (MLPs)hard Derivatives of Exponential Functionssoft Exponential Functions and Graphssoft Chain Rulesoft

Leads To (3)

Convolutional Neural Networkssoft Recurrent Neural Networkssoft Vanishing Gradient Problemhard