The Pearson correlation coefficient r measures linear association between two variables, ranging from -1 (perfect negative linear relationship) to +1 (perfect positive linear relationship), with 0 indicating no linear association. Defined as r = Cov(X,Y)/(σ_X × σ_Y), correlation is unitless and symmetric in X and Y. A correlation near 0 doesn't mean no relationship—it indicates no linear relationship; nonlinear associations may be strong but have correlation near 0.
Compute r for various datasets and compare to scatterplot. Generate data with specified correlations. Show examples where r = 0 but strong relationships exist.
Thinking r = 0 implies independence or no association. Confusing correlation with causation. Believing |r| > 0.5 indicates strong relationship (depends on context).
When you examined scatterplots, you developed an intuitive sense for association: points that trend upward together suggest a positive relationship; points that trend in opposite directions suggest a negative one; a shapeless cloud suggests none. The Pearson correlation coefficient r turns that intuition into a single number. It measures how closely the data points cluster around a straight line, ranging from −1 (a perfect downward line) through 0 (no linear trend) to +1 (a perfect upward line).
The formula is r = Cov(X, Y) / (σ_X · σ_Y), where Cov(X, Y) is the covariance — roughly, how much X and Y vary together — and σ_X, σ_Y are the standard deviations of each variable. Dividing by the standard deviations standardizes the result, which is why r is unitless and always falls in [−1, 1]. You can swap X and Y without changing r (it is symmetric), and multiplying either variable by a positive constant leaves r unchanged. These properties make r a clean, interpretable summary of linear association.
The phrase linear association is doing heavy lifting in that definition. r only detects *straight-line* patterns. A dataset where Y = X² (a perfect parabola) has r = 0, because the parabola is symmetric: for every upward movement of Y as X goes from 0 to 1, there is an equal upward movement as X goes from −1 to 0, and these cancel. The scatterplot would show an obvious strong relationship; r would tell you nothing. This is why examining the scatterplot before and after computing r is essential — r summarizes one aspect of the relationship, not the whole picture.
The most common misuse of r is treating it as evidence of causation. Two variables can be highly correlated because one causes the other, because both are caused by a third variable, or purely by chance in a small sample. Ice cream sales and sunburn rates both spike in summer; their correlation is high, but neither causes the other. Identifying correlation is the *beginning* of causal inquiry, not the end. Establishing causation requires controlled experiments or careful causal reasoning beyond what r can provide.
Finally, what counts as a "strong" correlation depends on context. In physics experiments, r = 0.95 might be disappointing. In social science, r = 0.4 between a questionnaire score and a real-world outcome might be remarkably good. The sign of r tells you direction; the magnitude tells you how tightly the points cluster around a line; but interpreting whether that magnitude is meaningful requires knowing the domain, the sample size, and what you are trying to predict. r is a tool — its value only becomes interpretable in context.