← Graph View All Domains

A topic in the Open Knowledge Graph — a free, open map of 15,290 topics and the order to learn them in.

Gene Prediction and Annotation

Graduate Depth 243 in the knowledge graph ☐ I know this ☆ Set as goal

36topics build on this

1,563prerequisites beneath it

See this on the map →

Genome Structure and Organization Transcription: DNA to RNA +3 more→→Functional Annotation RNA-seq Analysis Pipeline

Core Idea

Gene prediction identifies the locations and structures of genes within a genome sequence. Ab initio methods use statistical models (often hidden Markov models) trained on known gene features — start codons, splice sites, codon usage bias, and stop codons — to predict genes from sequence alone. Evidence-based methods use experimental data (ESTs, RNA-seq, protein alignments) to confirm or refine predictions. Modern annotation pipelines combine both approaches, integrating computational predictions with transcript evidence and cross-species homology to produce high-confidence gene models.

How It's Best Learned

Take a 100-kb unannotated bacterial sequence and find open reading frames using simple criteria (start codon, no internal stop, length threshold). Then try the same on a eukaryotic sequence and observe how introns make simple ORF-finding fail. Compare your manual predictions against an automated pipeline's output.

Common Misconceptions

Finding an open reading frame does not mean you have found a gene — many ORFs occur by chance, especially short ones.
Gene prediction in prokaryotes is far easier than in eukaryotes because prokaryotic genes lack introns and have simpler regulatory structures.

Explainer

A newly assembled genome is essentially a very long string of A, T, G, and C. The immediate question is: where are the genes? Gene prediction — also called gene finding or genome annotation — is the process of identifying the positions, boundaries, and structures of all genes in a genome sequence. The approaches and difficulty vary enormously between prokaryotes and eukaryotes.

In prokaryotes, gene prediction is relatively straightforward. Genes are contiguous (no introns), tightly packed, and account for 85-95% of the genome. An open reading frame (ORF) — a stretch of DNA from a start codon (ATG) to an in-frame stop codon (TAA, TAG, TGA) — that exceeds a length threshold (typically ~300 bp) is very likely a real gene. Tools like Prodigal and Glimmer use additional signals like ribosome binding sites (Shine-Dalgarno sequences) and codon usage statistics to distinguish real genes from chance ORFs with high accuracy, typically predicting 95-99% of genes correctly.

Eukaryotic gene prediction is fundamentally harder. Genes are split into exons and introns, so the coding sequence is scattered across genomic DNA. A human gene might span 50 kb of DNA but produce an mRNA of only 2 kb after splicing. The predictor must identify each exon, correctly call the splice donor (GT) and acceptor (AG) sites at intron boundaries, and assemble the right combination of exons — all while dealing with the fact that GT and AG dinucleotides occur frequently by chance. Ab initio methods like Augustus and GeneMark use hidden Markov models (HMMs) that model the statistical properties of exons, introns, intergenic regions, and splice sites to find the most probable gene structure. But accuracy from sequence alone is limited, especially for short exons, long introns, and genes with non-canonical features.

Modern genome annotation therefore combines multiple lines of evidence. RNA-seq data shows exactly which parts of the genome are transcribed and, through junction-spanning reads, how exons are spliced. Protein alignments from related species identify conserved coding regions. Known protein domains (from databases like Pfam) flag functional elements. Pipelines like MAKER and BRAKER integrate ab initio predictions with all available evidence, scoring each gene model by the amount of supporting data. The result is not a single answer but a ranked set of gene models with confidence levels — reflecting the reality that gene annotation is an ongoing refinement process that improves as more evidence accumulates.

Practice Questions 3 questions