Questions: Data Sharding and Partitioning Strategies
5 questions to test your understanding
Score: 0 / 5
Question 1 Multiple Choice
A database uses hash sharding on user_id. An analyst runs the query: 'Find all users who signed up between January and March 2024.' How does the database handle this query?
AIt routes the query to the shard responsible for the date range Jan–Mar 2024
BIt must contact every shard, because hash(user_id) scatters adjacent user_ids across all nodes — there is no correlation between signup date and shard location
CIt contacts only the shard that happens to store the most recent signups, since hash functions preserve insertion order
DIt uses a secondary index on signup date stored on the coordinator node to route the query efficiently
Hash sharding distributes data by hash(key) mod N, which scatters adjacent keys across all nodes uniformly. This eliminates hot spots but destroys range locality — there is no shard responsible for a date range. A range scan on any attribute other than the shard key requires contacting every shard (a 'scatter-gather' query). This is the fundamental tradeoff: hash sharding gives excellent uniform load distribution but makes range queries and analytics workloads expensive.
Question 2 Multiple Choice
A social media platform is choosing a shard key. Option A: shard on user_id (high cardinality, uniformly distributed). Option B: shard on country (low cardinality, uneven distribution). Why is option A generally better?
Auser_id produces more shards, which always improves performance
Bcountry causes hot spots because a few large countries (US, India) would receive a disproportionate fraction of traffic, overwhelming those shards while others sit idle
Cuser_id is better because alphabetical ordering makes range queries on users more efficient
Dcountry is worse only because it has fewer possible values, not because of traffic distribution
The critical failure mode of a poor shard key is the hot spot: when one shard receives significantly more traffic than others, horizontal scaling provides no benefit — the overloaded shard becomes the bottleneck. Country has low cardinality (around 200 values) and highly uneven traffic — the US and India alone represent a huge fraction of most platforms' user bases. Sharding on country concentrates traffic on those two shards. user_id has high cardinality and, with a good hash, uniform distribution — every shard gets roughly equal traffic.
Question 3 True / False
Hash sharding is strictly better than range sharding because it typically distributes load evenly and eliminates hot spots.
TTrue
FFalse
Answer: False
Hash sharding eliminates hot spots from natural key ordering, but at the cost of range query efficiency. When your workload involves range scans — such as 'find all orders placed last week' or 'list users with IDs between 10000 and 20000' — hash sharding forces scatter-gather queries that contact every shard. Range sharding handles these queries efficiently by routing to a single shard. The right choice depends on access patterns: hash sharding suits key-value lookups and point queries; range sharding suits analytics and ordered traversals. Neither is universally better.
Question 4 True / False
With range sharding on last name, all users with last names starting with 'J' can typically be served by querying a single shard.
TTrue
FFalse
Answer: True
Range sharding assigns contiguous key ranges to nodes, so all keys within a range are co-located on the same shard. If the range partition assigns A–F to shard 1, G–M to shard 2, etc., then all 'J' names fall within the G–M range and live on shard 2. This is the defining advantage of range sharding: range queries are single-shard and efficient. The disadvantage is that uneven data or traffic distribution within a range creates hot spots — if most users have names starting with 'S', the S-range shard may be overloaded.
Question 5 Short Answer
A startup is designing a sharding strategy for a social media platform. Why might sharding on user_id be better than sharding on country, and what failure mode remains even with a good shard key?
Think about your answer, then reveal below.
Model answer: Sharding on user_id provides high cardinality and, with a hash function, near-uniform distribution across shards — each shard gets roughly equal traffic regardless of where users are from. Country has only ~200 possible values and highly skewed traffic (a few countries dominate), causing hot spots on those shards. Even with user_id as the shard key, the remaining failure mode is cross-shard queries: operations that need data from many users simultaneously — like 'find all mutual friends between two users on different shards' or 'compute global leaderboards' — require contacting multiple shards, increasing latency and coordination overhead.
Shard key selection is the single most consequential design decision in a sharded system. A poor key concentrates traffic and negates the benefits of horizontal scaling. But even a well-chosen key cannot eliminate cross-shard queries for all workloads — the fundamental tradeoff is between distribution uniformity (preventing hot spots) and data locality (keeping related data on the same shard to avoid scatter-gather). Most production systems accept some cross-shard queries for rare operations while optimizing the shard key for the most common access patterns.