← Graph View All Domains

A topic in the Open Knowledge Graph — a free, open map of 15,290 topics and the order to learn them in.

Correlation Coefficient

College Depth 60 in the knowledge graph ☐ I know this ☆ Set as goal

399topics build on this

273prerequisites beneath it

See this on the map →

Scatterplots and Correlation Measures of Spread→→Beta and Systematic Risk Capital Asset Pricing Model (CAPM)+14 more

correlation pearson r association

Core Idea

The Pearson correlation coefficient r measures linear association between two variables, ranging from -1 (perfect negative linear relationship) to +1 (perfect positive linear relationship), with 0 indicating no linear association. Defined as r = Cov(X,Y)/(σ_X × σ_Y), correlation is unitless and symmetric in X and Y. A correlation near 0 doesn't mean no relationship—it indicates no linear relationship; nonlinear associations may be strong but have correlation near 0.

How It's Best Learned

Compute r for various datasets and compare to scatterplot. Generate data with specified correlations. Show examples where r = 0 but strong relationships exist.

Common Misconceptions

Thinking r = 0 implies independence or no association. Confusing correlation with causation. Believing |r| > 0.5 indicates strong relationship (depends on context).

Explainer

When you examined scatterplots, you developed an intuitive sense for association: points that trend upward together suggest a positive relationship; points that trend in opposite directions suggest a negative one; a shapeless cloud suggests none. The Pearson correlation coefficient r turns that intuition into a single number. It measures how closely the data points cluster around a straight line, ranging from −1 (a perfect downward line) through 0 (no linear trend) to +1 (a perfect upward line).

The formula is r = Cov(X, Y) / (σ_X · σ_Y), where Cov(X, Y) is the covariance — roughly, how much X and Y vary together — and σ_X, σ_Y are the standard deviations of each variable. Dividing by the standard deviations standardizes the result, which is why r is unitless and always falls in [−1, 1]. You can swap X and Y without changing r (it is symmetric), and multiplying either variable by a positive constant leaves r unchanged. These properties make r a clean, interpretable summary of linear association.

The phrase linear association is doing heavy lifting in that definition. r only detects *straight-line* patterns. A dataset where Y = X² (a perfect parabola) has r = 0, because the parabola is symmetric: for every upward movement of Y as X goes from 0 to 1, there is an equal upward movement as X goes from −1 to 0, and these cancel. The scatterplot would show an obvious strong relationship; r would tell you nothing. This is why examining the scatterplot before and after computing r is essential — r summarizes one aspect of the relationship, not the whole picture.

The most common misuse of r is treating it as evidence of causation. Two variables can be highly correlated because one causes the other, because both are caused by a third variable, or purely by chance in a small sample. Ice cream sales and sunburn rates both spike in summer; their correlation is high, but neither causes the other. Identifying correlation is the *beginning* of causal inquiry, not the end. Establishing causation requires controlled experiments or careful causal reasoning beyond what r can provide.

Finally, what counts as a "strong" correlation depends on context. In physics experiments, r = 0.95 might be disappointing. In social science, r = 0.4 between a questionnaire score and a real-world outcome might be remarkably good. The sign of r tells you direction; the magnitude tells you how tightly the points cluster around a line; but interpreting whether that magnitude is meaningful requires knowing the domain, the sample size, and what you are trying to predict. r is a tool — its value only becomes interpretable in context.

Practice Questions 3 questions

Prerequisite Chain

Understanding Zero → The Number Zero → Counting to Five → Counting to 10 → Counting to 20 → Counting a Set of Objects Up to 20 → Cardinality: The Last Number Counted → Matching Numerals to Quantities → Subitizing Small Quantities → Addition Within 10 → Making 10 as an Addition Strategy → Addition Within 20 → Doubles and Near Doubles → Doubles Facts Within 10 → Near Doubles Facts Within 20 → Mental Math Strategies for Addition → Mental Math: Adding and Subtracting Tens → Addition Within 100 → Repeated Addition as Multiplication → Multiplication as Equal Groups → Multiplication: Arrays → Basic Multiplication Facts Through 10 → Multiplication Facts Within 100 → Division as Equal Sharing → Division as Grouping (Measurement Division) → Division: Grouping (Repeated Subtraction) Model → Division: Fair Sharing Model → Division as Equal Sharing → Division as Grouping → Basic Division Facts → Division Facts Within 100 → Multiplication and Division Fact Families → Relationship Between Multiplication and Division → Division Facts as Inverse of Multiplication → Remainders and Quotients in Division → Division Word Problems → Multi-Step Word Problems → Solving Multi-Step Word Problems → Multiplication Word Problems → Division Word Problems → Introduction to Long Division → Factors and Multiples → Prime and Composite Numbers → Equivalent Fractions → Relating Fractions and Decimals → Decimal Place Value → Integers and the Number Line → Comparing and Ordering Integers → Length Comparison → Measuring Length with Non-Standard Units → Measuring Length With a Ruler → Measuring with Feet and Meters → Estimating Lengths → Line Plots with Measurement Data → Organizing and Representing Data → Creating Tally Charts → Creating and Reading Picture Graphs → Scaled Bar Graphs → Mean, Median, and Mode → Measures of Spread → Correlation Coefficient

Longest path: 61 steps · 273 total prerequisite topics

Prerequisites (2)

Scatterplots and Correlationhard Measures of Spreadsoft

Leads To (16)

Beta and Systematic Risksoft Capital Asset Pricing Model (CAPM)hard Classical Test Theory Foundationssoft Correlational Research Designsoft Linear Regression Basicshard Measurement Reliability: Types and Estimationsoft Mediation Analysis and Indirect Effects in Causal Pathwayshard Multicollinearityhard Portfolio Diversificationhard R-Squared and Model Fitsoft Reliability Estimation Methods and Method Selectionsoft Reliability in Psychological Measurementsoft Simple (Bivariate) OLS Regressionhard Simple Linear Regressionhard Test-Retest Reliability and Temporal Stabilitysoft Validity in Psychological Measurementsoft