Gene prediction identifies the locations and structures of genes within a genome sequence. Ab initio methods use statistical models (often hidden Markov models) trained on known gene features — start codons, splice sites, codon usage bias, and stop codons — to predict genes from sequence alone. Evidence-based methods use experimental data (ESTs, RNA-seq, protein alignments) to confirm or refine predictions. Modern annotation pipelines combine both approaches, integrating computational predictions with transcript evidence and cross-species homology to produce high-confidence gene models.
Take a 100-kb unannotated bacterial sequence and find open reading frames using simple criteria (start codon, no internal stop, length threshold). Then try the same on a eukaryotic sequence and observe how introns make simple ORF-finding fail. Compare your manual predictions against an automated pipeline's output.
A newly assembled genome is essentially a very long string of A, T, G, and C. The immediate question is: where are the genes? Gene prediction — also called gene finding or genome annotation — is the process of identifying the positions, boundaries, and structures of all genes in a genome sequence. The approaches and difficulty vary enormously between prokaryotes and eukaryotes.
In prokaryotes, gene prediction is relatively straightforward. Genes are contiguous (no introns), tightly packed, and account for 85-95% of the genome. An open reading frame (ORF) — a stretch of DNA from a start codon (ATG) to an in-frame stop codon (TAA, TAG, TGA) — that exceeds a length threshold (typically ~300 bp) is very likely a real gene. Tools like Prodigal and Glimmer use additional signals like ribosome binding sites (Shine-Dalgarno sequences) and codon usage statistics to distinguish real genes from chance ORFs with high accuracy, typically predicting 95-99% of genes correctly.
Eukaryotic gene prediction is fundamentally harder. Genes are split into exons and introns, so the coding sequence is scattered across genomic DNA. A human gene might span 50 kb of DNA but produce an mRNA of only 2 kb after splicing. The predictor must identify each exon, correctly call the splice donor (GT) and acceptor (AG) sites at intron boundaries, and assemble the right combination of exons — all while dealing with the fact that GT and AG dinucleotides occur frequently by chance. Ab initio methods like Augustus and GeneMark use hidden Markov models (HMMs) that model the statistical properties of exons, introns, intergenic regions, and splice sites to find the most probable gene structure. But accuracy from sequence alone is limited, especially for short exons, long introns, and genes with non-canonical features.
Modern genome annotation therefore combines multiple lines of evidence. RNA-seq data shows exactly which parts of the genome are transcribed and, through junction-spanning reads, how exons are spliced. Protein alignments from related species identify conserved coding regions. Known protein domains (from databases like Pfam) flag functional elements. Pipelines like MAKER and BRAKER integrate ab initio predictions with all available evidence, scoring each gene model by the amount of supporting data. The result is not a single answer but a ranked set of gene models with confidence levels — reflecting the reality that gene annotation is an ongoing refinement process that improves as more evidence accumulates.