Genome assembly reconstructs a complete genome sequence from millions of short sequencing reads. De novo assembly builds the genome without a reference, typically using overlap-layout-consensus (for long reads) or de Bruijn graph approaches (for short reads). Reference-guided assembly maps reads to an existing reference genome. Assembly quality is measured by metrics like N50 (the contig length at which half the assembly is in contigs of that length or longer), total assembly size, and completeness (e.g., BUSCO scores). Repetitive sequences are the primary obstacle, creating ambiguities that fragment the assembly.
Assemble a small bacterial genome (~5 Mb) from simulated Illumina reads using SPAdes. Examine the output: count contigs, compute N50, and identify where the assembly broke — typically at repetitive elements. Then compare to an assembly of the same genome using long reads.
Sequencing technologies produce reads — short stretches of determined sequence, typically 150-300 bp for Illumina or 10,000-100,000+ bp for long-read platforms. A human genome is 3.2 billion base pairs. Assembly is the computational process of piecing millions of overlapping reads back together into the original genome sequence, like solving a jigsaw puzzle with billions of pieces, many of which look identical.
For short-read assembly, the dominant approach uses de Bruijn graphs. The algorithm breaks each read into overlapping k-mers (subsequences of length k, typically 21-127 bp), builds a graph where each k-mer is a node and overlapping k-mers are connected by edges, then finds paths through the graph that represent the original sequences. The advantage over simple overlap-based methods is computational efficiency — building pairwise overlaps for billions of reads is prohibitively expensive, while k-mer graph construction is linear in the number of reads. Tools like SPAdes, MEGAHIT, and Velvet use this approach with various refinements.
For long-read assembly, overlap-layout-consensus (OLC) methods are more natural. Because long reads span repetitive regions, the overlap graph is less tangled, and the assembler can resolve structures that short reads cannot. Tools like Canu, Hifiasm, and Flye are designed for long reads. The tradeoff is that long reads historically had higher error rates (5-15% for PacBio CLR, 5-10% for Oxford Nanopore), requiring consensus correction. Modern PacBio HiFi reads achieve 99.9% accuracy at 15-20 kb lengths, combining the advantages of both worlds.
Assembly quality is assessed by multiple metrics. N50 measures contiguity — a higher N50 means longer unbroken sequences. BUSCO (Benchmarking Universal Single-Copy Orthologs) checks whether expected conserved genes are present and complete, measuring biological completeness rather than just contiguity. Total assembly size should approximate the expected genome size. The gap between a fragmented draft assembly (thousands of contigs) and a finished, chromosome-level assembly is enormous, and closing that gap typically requires combining multiple data types: short reads for base accuracy, long reads for contiguity, and scaffolding technologies (Hi-C, optical mapping) to order and orient contigs into chromosome-scale sequences.