The expected value E[X] = Σ x × p(x) is the long-run average value of a random variable, representing its center. Variance Var(X) = E[(X - E[X])²] measures the spread of the distribution around its mean. Standard deviation σ = √Var(X) is variance expressed in the original units. These moments summarize key features of a distribution's shape and behavior.
Compute expected value and variance for simple distributions (fair die, coin flip). Verify that variance increases when probability mass spreads away from the mean.
Thinking E[X] is always the most likely value. Confusing variance with standard deviation in interpretation. Misunderstanding that E[aX + b] = aE[X] + b but Var(aX + b) = a²Var(X).
The expected value E[X] is the mathematical formalization of "long-run average." If you roll a fair die thousands of times and track the running average, that average will converge toward 3.5 — even though 3.5 is never actually rolled. The formula E[X] = Σ x · p(x) weights each possible outcome by its probability and sums the products. Geometrically, the expected value is the balance point, or center of mass, of the probability distribution: if you placed physical weights proportional to each probability on a number line, the distribution would balance at E[X].
A critical misconception: the expected value is not the most likely value. The most likely value is the mode. For symmetric distributions these coincide, but for skewed distributions they can be far apart. If X takes value 0 with probability 0.9 and value 100 with probability 0.1, then E[X] = 10 — yet the most common outcome is 0. Income distributions are a real-world example: average income is pulled upward by high earners, while median and modal income are much lower. The expected value can even be a value that X can never take (3.5 for a die; 10 in the example above).
Variance Var(X) = E[(X − E[X])²] measures spread. It asks: on average, how far does X deviate from its mean, in squared terms? Squaring the deviation serves two purposes: it makes all deviations positive (so negative and positive deviations don't cancel), and it penalizes large deviations more heavily than small ones. The standard deviation σ = √Var(X) brings the units back in line with X, making it more interpretable as "typical distance from the mean."
The transformation rules for mean and variance capture something deep. For E[aX + b] = aE[X] + b: shifting every outcome by b shifts the average by b, and scaling by a scales the average by a. For variance: Var(aX + b) = a²Var(X). Adding a constant b moves every value by the same amount, so all deviations from the mean are unchanged — variance is unaffected. Multiplying by a scales every value and every deviation by a, so squared deviations scale by a². This asymmetry — E scales linearly but Var scales by the square — is a common source of errors and is essential to remember.
These two moments — mean and variance — do not fully characterize a distribution (you need the full density for that), but they capture the two most important features: where it is centered and how spread out it is. Nearly all of statistical inference builds on them. When you study the normal distribution, the binomial, and eventually the central limit theorem, you will use E[X] and Var(X) constantly — both to characterize distributions directly and to describe how statistics computed from samples behave.