The count-min sketch maintains a d-by-w matrix of counters. When an item x arrives with frequency update, each counter in row i is incremented at position h_i(x). Why is sketch merging (element-wise addition of two sketches) exact, and how does this enable distributed processing?
AMerging is approximate; it introduces additional errors from the two sketches
BThe count-min sketch is a linear function of the frequency vector — sketch(f1 + f2) = sketch(f1) + sketch(f2) exactly because each counter independently sums the frequencies of items hashing to its position
CMerging is exact only if the two sketches use the same random hash functions, but this is impractical for distributed systems
DMerging requires re-hashing all items, which defeats the purpose of sketching
Each counter at position (i, j) tracks the sum of frequencies f_x for all items x such that h_i(x) = j. This is a linear operation on the frequency vector. When two sketches are merged (added element-wise), counter (i, j) in the merged sketch equals the sum of the two original counters, which equals the frequency sum from both input streams combined. Distributed count-min works by: partition the stream across compute nodes, each computes a sketch, send sketches to a central coordinator (O(d * w) bits each), sum them element-wise, and query the merged sketch. The result is identical to sketching the concatenated stream, making distribution seamless.
Question 2 True / False
The Johnson-Lindenstrauss lemma states that random projections preserve all pairwise distances in a set of n points to within a (1 ± epsilon) multiplicative factor, using only O(log(n) / epsilon^2) dimensions. This enables approximate nearest-neighbor search in high-dimensional data.
TTrue
FFalse
Answer: True
Given n points in d-dimensional space (d can be huge, like 10^6 for image features), choose a random d-by-k matrix where k = O(log(n) / epsilon^2). Project each point: p' = A * p. With high probability, ||p'_i - p'_j|| ≈ (1 ± epsilon) * ||p_i - p_j|| for all pairs (i,j). Since distances are preserved, nearest neighbors in the original space remain near-neighbors in the projection, enabling fast approximate nearest-neighbor via brute force in O(k * n^2) = O(log(n) / epsilon^2 * n^2) time instead of brute force in original space. For massive-scale problems, this dimension reduction is critical.
Question 3 Short Answer
Min-hashing estimates the Jaccard similarity between two sets A and B. The method: choose k hash functions, for each set compute the minimum hash value across all elements, then estimate Jaccard similarity as the fraction of hash functions on which the two sets agree. Why does this work?
Think about your answer, then reveal below.
Model answer: For a random hash function h, min(h(A)) = min(h(B)) if and only if the minimum hash value comes from the union A ∪ B and belongs to both A and B (since if only one set contains the minimum, they cannot agree). The probability that min(h(A)) = min(h(B)) equals the probability that the element yielding the minimum hash is in both sets, which is |A ∩ B| / |A ∪ B| = Jaccard(A, B). Each of k hash functions provides an independent estimate; the fraction that agree (estimate) converges to true Jaccard similarity. With k = O(1 / epsilon^2) functions, the estimate is (1 ± epsilon)-multiplicative with high probability.
Min-hashing is elegant: a single number (minimum hash value) per set encodes membership probability information. Multiple repetitions reduce variance. This is foundational for large-scale similarity search and set operations in databases.
Question 4 True / False
Sketches are mergeable (can be added/combined) because they are linear functions of the input. This property enables one-shot distributed algorithms and streaming updates that would be impossible with non-linear summaries.
TTrue
FFalse
Answer: True
A sketch is linear if sketch(f) = A * f (matrix-vector multiplication) for some fixed matrix A. Then sketch(f1 + f2) = A * (f1 + f2) = A * f1 + A * f2 = sketch(f1) + sketch(f2). Non-linear summaries like median or percentile are not mergeable: you cannot compute the median of a combined dataset by merging medians of subsets. Linearity is why count-min and min-hashing are so practical for distributed systems: compute sketches independently at nodes, transmit O(space) bits, merge, answer queries. This would be infeasible with non-linear statistics.