External Memory Algorithms

Research Depth 66 in the knowledge graph I know this Set as goal
Unlocks 1 downstream topic
external-memory io-complexity disk-algorithms memory-hierarchy

Core Idea

External memory (EM) algorithms optimize for the memory hierarchy by minimizing the number of block transfers between slow storage (disk) and fast memory (RAM). In the standard model, memory holds M elements, disk blocks hold B elements, and the cost metric is the number of I/O operations (block transfers). Scanning N elements costs Theta(N/B), sorting costs Theta((N/B) * log_{M/B}(N/B)), and searching in a B-tree costs Theta(log_B N). The sorting bound reveals a fundamental separation from the RAM model: the log base is M/B (the number of blocks fitting in memory), not 2, reflecting the ability to merge M/B sorted runs simultaneously. External memory sorting is the foundational primitive — many EM algorithms reduce to sorting, making the sorting I/O bound the de facto baseline.

Explainer

The RAM model of computation assumes uniform-cost memory access — reading any memory location costs the same. Modern hardware violates this dramatically: L1 cache access takes ~1ns, RAM takes ~100ns, SSD takes ~100μs, and HDD takes ~10ms. The external memory model captures this hierarchy by distinguishing fast memory (size M) from slow storage (block size B), and counting block transfers as the cost measure. This simple two-level model, introduced by Aggarwal and Vitter (1988), yields a rich theory with practical implications for any computation on data too large for RAM.

The foundational result is the external memory sorting bound: Theta((N/B) * log_{M/B}(N/B)) I/Os. Understanding why this is optimal requires seeing what information each I/O provides. Reading a block of B elements reveals their relative order (B! possibilities) among at most M elements in memory — each I/O resolves at most log(B! * C(M,B)) ≈ B * log(M/B) bits of uncertainty about the final ordering. The total uncertainty is log(N!) ≈ N log N bits, so at least N log N / (B log(M/B)) = (N/B) * log_{M/B}(N/B) I/Os are needed. The M/B-way merge sort achieves this: create N/M sorted runs, merge M/B at a time, each pass costs 2N/B I/Os, and log_{M/B}(N/M) passes suffice.

B-trees are the EM analog of balanced binary search trees. Each node contains Theta(B) keys (filling one disk block), and each I/O during a search eliminates a Theta(B)-fraction of the remaining candidates. This gives O(log_B N) search cost instead of O(log_2 N) — with B = 1000, searching a billion keys takes about 3 I/Os instead of 30. B-trees also support efficient range queries (O(log_B N + K/B) for K results) and updates (O(log_B N) amortized). This combination of efficient point queries, range queries, and updates explains why B-trees are the universal data structure for databases and file systems.

Beyond sorting and searching, many EM algorithms follow a "reduce to sorting" paradigm. Graph algorithms (BFS, connected components, MST) achieve optimal I/O bounds by converting graph traversal into a sequence of sorting and scanning operations, avoiding the random access patterns that make naive graph algorithms I/O-inefficient. The external memory model also connects to the streaming model: a streaming algorithm with S bits of memory can be viewed as an external memory algorithm with M = S/B blocks, making one pass over the data. This connection explains why streaming lower bounds and EM lower bounds often use similar techniques.

Practice Questions 4 questions

Prerequisite Chain

Counting to 10Counting to 20Understanding ZeroThe Number ZeroCounting to FiveOne-to-One CorrespondenceCombining Small Groups Within 5Addition Within 10Addition Within 20Two-Digit Addition Without RegroupingTwo-Digit Addition with RegroupingAddition Within 100Repeated Addition as MultiplicationMultiplication Facts Within 100Division as Equal SharingDivision as Grouping (Measurement Division)Division: Grouping (Repeated Subtraction) ModelDivision: Fair Sharing ModelDivision as Equal SharingDivision as GroupingBasic Division FactsDivision Facts Within 100Two-Digit by One-Digit DivisionDivision with RemaindersRemainders and Quotients in DivisionDivision Word ProblemsIntroduction to Long DivisionFactors and MultiplesPrime and Composite NumbersEquivalent FractionsRelating Fractions and DecimalsDecimal Place ValueReading and Writing DecimalsComparing and Ordering DecimalsAdding and Subtracting DecimalsMultiplying DecimalsDividing DecimalsDividing FractionsMixed Number ArithmeticOrder of OperationsInteger Order of OperationsVariable ExpressionsCombining Like TermsOne-Step EquationsTwo-Step EquationsSolving Multi-Step EquationsEquations with Variables on Both SidesLiteral EquationsSlope-Intercept FormPoint-Slope FormWriting Linear EquationsParallel and Perpendicular Line SlopesGraphing Linear EquationsPiecewise FunctionsStep FunctionsComposition of FunctionsInverse FunctionsRadical Functions and GraphsRational ExponentsExponential Functions and GraphsLogarithms IntroductionBig-O Notation and Asymptotic AnalysisAlgorithm Analysis and Complexity ClassesDivide-and-Conquer and the Master TheoremDivide and ConquerMerge SortExternal Memory Algorithms

Longest path: 67 steps · 366 total prerequisite topics

Prerequisites (3)

Leads To (1)