Genomes are far more than linear arrays of genes. In eukaryotes, protein-coding sequences (exons) typically constitute a small fraction of the genome (about 1.5% in humans), with the remainder comprising introns, regulatory elements, repetitive sequences (transposons, SINEs, LINEs), and other noncoding DNA. Genome size does not correlate with organism complexity (the C-value paradox). Understanding genome organization — gene density, repeat content, GC content variation, chromatin domains, and chromosome structure — is essential for interpreting genomic data and predicting gene locations.
Compare genome statistics (size, gene count, gene density, repeat fraction) across a bacterium, yeast, fruit fly, and human. Visualize a 1-Mb region of the human genome in a genome browser (UCSC or Ensembl) and annotate what fraction is coding, intronic, repetitive, and intergenic.
When the Human Genome Project published its draft in 2001, one of the biggest surprises was how little of the genome actually codes for proteins. Only about 1.5% of the 3.2 billion base pairs are exonic. The rest is a complex landscape of introns, regulatory sequences, ancient transposable elements, and sequences whose functions (if any) are still debated. Understanding this landscape is the first step in making sense of any genomic dataset.
Eukaryotic genomes are organized at multiple scales. At the finest level, genes consist of exons (coding) interspersed with introns (removed during RNA splicing). Human genes average about 27 kilobases but vary wildly — the dystrophin gene spans 2.4 megabases while some histone genes are intronless. Surrounding genes are regulatory elements: promoters, enhancers, silencers, and insulators, sometimes located hundreds of kilobases from the genes they control. Between genes lie intergenic regions containing repetitive elements and sequences of unknown function.
Repetitive elements dominate many eukaryotic genomes. In humans, transposable elements and their remnants constitute about 45% of the genome. Long interspersed nuclear elements (LINEs, particularly LINE-1) and short interspersed nuclear elements (SINEs, particularly Alu elements) are the most abundant. These sequences are mostly inactive fossils of past transposition events, but some remain active and contribute to ongoing genomic variation. Tandem repeats (microsatellites and minisatellites) are another category, used extensively in forensic genetics and population studies due to their high polymorphism rates.
The variation in genome organization across species is dramatic and informative. Bacterial genomes are compact — mostly coding, few introns, little repetitive DNA. Yeast genomes are intermediate. Plant genomes are often enormous due to whole-genome duplications and transposon proliferation (maize is ~85% repetitive). This variation means that genomics tools and approaches must be tuned to the specific genome being studied: gene prediction algorithms trained on compact genomes perform poorly on repeat-rich mammalian genomes, and assembly strategies that work for bacteria fail on polyploid plants. Genome structure is not just background knowledge — it directly shapes every computational analysis performed on genomic data.