A topic in the Open Knowledge Graph — a free, open map of 15,290 topics and the order to learn them in.

Random Sampling Techniques

Research Depth 97 in the knowledge graph ☐ I know this ☆ Set as goal

3topics build on this

605prerequisites beneath it

Expected Value and Variance Randomized Algorithms +1 more→→Concentration Inequalities for Algorithm Design Sublinear Algorithms

Core Idea

Random sampling is a foundational technique in algorithm design where selecting elements randomly from a dataset enables efficient estimation, selection, and optimization. Reservoir sampling solves the problem of uniformly sampling k items from a stream of unknown length in O(k) space. Importance sampling reweights samples to reduce variance when estimating expectations, enabling efficient simulation of rare events. Random sampling underpins randomized selection (expected O(n) median finding), random projections (Johnson-Lindenstrauss dimensionality reduction), and the design of sublinear-time algorithms that make decisions by examining only a small fraction of the input.

Explainer

Random sampling is one of the most versatile tools in the algorithm designer's toolkit. At its simplest, drawing a random subset of an input lets you estimate global properties without examining every element. But the techniques range from the elegant (reservoir sampling for streams) to the sophisticated (importance sampling for variance reduction), and the theoretical foundations connect to concentration inequalities, approximation theory, and information-theoretic limits.

Reservoir sampling addresses a clean problem: maintain a uniform random sample of k elements from a data stream whose length is unknown. The algorithm initializes the reservoir with the first k elements, then for each subsequent element i, includes it with probability k/i (replacing a random existing element). The proof of correctness is a beautiful telescoping argument: each element's survival probability across all future replacement rounds collapses to exactly k/n. The algorithm uses O(k) memory regardless of stream length, making it practical for massive data streams where you cannot store or revisit the data.

Importance sampling solves a different problem: efficiently estimating E_p[f(x)] when sampling from p is difficult or when naive sampling has high variance. Instead of drawing from p, you sample from a proposal distribution q and reweight each sample by p(x)/q(x). The estimator is unbiased for any q with adequate support, but the variance depends critically on how well q matches the shape of |f(x)| * p(x). The optimal proposal concentrates samples where the integrand is large, dramatically reducing the number of samples needed. This is essential in computational physics (rare event simulation), Bayesian inference (sampling from complex posteriors), and Monte Carlo integration.

The deeper significance of random sampling is that it enables sublinear-time computation. If you want to determine whether a property holds for most elements of a massive dataset, you do not need to examine every element — a random sample of size O(1/epsilon) suffices to distinguish "property holds everywhere" from "property fails on epsilon-fraction of elements," independent of the dataset size. This insight underlies property testing, streaming algorithms, and the entire field of sublinear algorithms. The price is approximation: you sacrifice exact answers for massive speed gains. But in an era of terabyte-scale data, an approximate answer in seconds often dominates an exact answer in hours.

Practice Questions 4 questions

Prerequisite Chain

Understanding Zero → The Number Zero → Counting to Five → Counting to 10 → Counting to 20 → Counting a Set of Objects Up to 20 → Cardinality: The Last Number Counted → Matching Numerals to Quantities → Subitizing Small Quantities → Addition Within 10 → Number Bonds to 10 → Addition Within 20 → Doubles and Near Doubles → Doubles Facts Within 10 → Near Doubles Facts Within 20 → Mental Math Strategies for Addition → Mental Math: Adding and Subtracting Tens → Addition Within 100 → Repeated Addition as Multiplication → Multiplication as Equal Groups → Multiplication: Arrays → Basic Multiplication Facts (0s, 1s, 2s, 5s, 10s) → Multiplication Facts Within 100 → Division as Equal Sharing → Division as Grouping (Measurement Division) → Division: Grouping (Repeated Subtraction) Model → Division: Fair Sharing Model → Division as Equal Sharing → Division as Grouping → Basic Division Facts → Division Facts Within 100 → Multiplication and Division Fact Families → Relationship Between Multiplication and Division → Division Facts as Inverse of Multiplication → Remainders and Quotients in Division → Division Word Problems → Multi-Step Word Problems → Solving Multi-Step Word Problems → Multiplication Word Problems → Division Word Problems → Introduction to Long Division → Factors and Multiples → Prime and Composite Numbers → Equivalent Fractions → Relating Fractions and Decimals → Decimal Place Value → Integers and the Number Line → Comparing and Ordering Integers → Absolute Value → Adding Integers → Subtracting Integers → Multiplying Integers → Introduction to Exponents → Order of Operations → Integer Order of Operations → Variable Expressions → The Distributive Property → Variables and Expressions Review → Introduction to Polynomials → Adding and Subtracting Polynomials → Multiplying Polynomials → Factorial → Permutations → Combinations → Counting Principles: Addition and Multiplication Rules → Introduction to Graph Theory → Propositional Logic Foundations → Logical Equivalences → Boolean Algebra → Boolean Type and Truth Values → Comparison Operators and Boolean Tests → Logical Operators and Boolean Algebra → Boolean Algebra and Fundamental Laws → Logic Gates Fundamentals → Implementing Boolean Functions with Gates → Karnaugh Map Simplification → Combinational Circuit Design → Flip-Flops and Latches → Finite State Machines (FSMs) → Deterministic Finite Automata (DFA) → Nondeterministic Finite Automata (NFA) → Two-Way Finite Automata → NFA to DFA Conversion (Subset Construction) → DFA Properties and Minimization Algorithms → Regular Languages: Definition and Characterization → Context-Free Grammars (CFGs) → Pushdown Automata (PDA) → Equivalence of CFGs and Pushdown Automata → Closure Properties of Context-Free Languages → Limitations of Context-Free Languages → Pumping Lemma for Context-Free Languages → Turing Machines → Variants of Turing Machines and Equivalence → Nondeterministic Time Complexity and NP → The P vs. NP Problem → Complexity Class P: Polynomial Time → Randomized Algorithms → Random Sampling Techniques

Longest path: 98 steps · 605 total prerequisite topics

Prerequisites (3)

Randomized Algorithmshard Expected Value and Variancehard Probability Density Functionssoft

Leads To (2)

Concentration Inequalities for Algorithm Designsoft Sublinear Algorithmshard