Covariance Cov(X,Y)=E[(X−μ_X)(Y−μ_Y)] measures linear association; equals 0 if independent but nonzero doesn't imply dependence. Correlation ρ=Cov(X,Y)/(σ_X σ_Y) ∈ [−1,1] is scale-invariant. Zero correlation means no linear association.
From expected value theory, you know E[X] = ∫ x f(x) dx and you've computed expectations of functions of a single random variable. From joint distributions, you know how to describe the behavior of two random variables together through their joint density or PMF, and how to recover marginal distributions. Covariance combines these ideas: it is the expected value of the product (X − μ_X)(Y − μ_Y), which measures whether X and Y tend to deviate from their means in the same direction at the same time.
The intuition is concrete. If X tends to be above its mean when Y is above its mean — and below when Y is below — then the product (X − μ_X)(Y − μ_Y) is typically positive, and Cov(X, Y) > 0. This is positive covariance: height and weight, for instance. If X tends to be high when Y is low (like temperature and heating bills), the product is typically negative, giving Cov(X, Y) < 0. If X and Y have no systematic linear relationship, the positive and negative products cancel out and Cov(X, Y) ≈ 0. The computational shortcut Cov(X, Y) = E[XY] − E[X]E[Y] follows directly from expanding the definition and using linearity of expectation.
The flaw with raw covariance as a measure of association is that it depends on scale. If you measure X in centimeters instead of meters, covariance multiplies by 100. This makes comparing covariances across different pairs of variables meaningless. Correlation ρ = Cov(X, Y) / (σ_X σ_Y) fixes this by normalizing: dividing by the product of standard deviations removes all units and scale. The result always lies in [−1, 1], a consequence of the Cauchy-Schwarz inequality applied to the inner product E[XY] on the space of square-integrable random variables. The extreme values ρ = ±1 occur precisely when Y = aX + b almost surely for some constants a and b — a perfect linear relationship.
The most important conceptual trap is the independence–correlation relationship. If X and Y are independent, then E[XY] = E[X]E[Y] (from the joint distribution factoring), so Cov(X, Y) = 0 and ρ = 0. But the converse fails: zero correlation does not imply independence. The classic example is X ~ Uniform(−1, 1) and Y = X². Then E[XY] = E[X³] = 0 (by symmetry of X³ around zero), and E[X]E[Y] = 0, so Cov(X, Y) = 0. Yet X and Y are completely dependent — knowing X determines Y exactly. The correlation captures only linear dependence; the full dependency structure requires the joint distribution.
Covariance and correlation are foundational to everything that builds on joint distributions. In linear regression, the slope of Y on X is β = Cov(X, Y)/Var(X), and R² equals ρ² — so the correlation coefficient literally measures the fraction of variance in Y explained by a linear function of X. In the bivariate normal distribution, which you'll see next, ρ is the single parameter characterizing the dependency between the two jointly normal components. More broadly, the covariance matrix Σ with entries Σᵢⱼ = Cov(Xᵢ, Xⱼ) is the fundamental object describing the geometry of multivariate distributions, and every technique in multivariate statistics — PCA, factor analysis, the multivariate normal — is built on manipulating it.