A topic in the Open Knowledge Graph — a free, open map of 15,290 topics and the order to learn them in.

Variant Calling and Genome-Wide Association Studies

Research Depth 243 in the knowledge graph ☐ I know this ☆ Set as goal

6topics build on this

1,429prerequisites beneath it

DNA Sequencing Technologies Population Genetics and Hardy-Weinberg Equilibrium +2 more→→Machine Learning in Genomics Multi-Omics Integration +2 more

Core Idea

Variant calling identifies positions where an individual's genome differs from a reference sequence, detecting single nucleotide variants (SNVs), small insertions/deletions (indels), and structural variants. Tools like GATK HaplotypeCaller use Bayesian models that integrate base quality scores, mapping quality, and local realignment to distinguish true variants from sequencing errors. Genome-wide association studies (GWAS) test whether any of these variants are statistically associated with a phenotype (disease, trait) across a population, typically testing millions of SNPs and correcting for multiple testing using a genome-wide significance threshold of p < 5e-8. Associated variants identify genomic regions, not necessarily causal genes.

How It's Best Learned

Walk through the GATK Best Practices pipeline on a small dataset: align reads, mark duplicates, call variants, and filter. Then examine a published GWAS Manhattan plot and trace one significant peak to its genomic context — what genes are nearby? Is the lead SNP coding or regulatory? Is the causal variant known?

Common Misconceptions

A GWAS hit does not identify the causal variant — it identifies a region in linkage disequilibrium with the causal variant, which may be a noncoding regulatory element rather than a protein-coding change.
Variant calling accuracy depends heavily on sequencing depth; calling heterozygous variants reliably typically requires 30x coverage or more.

Explainer

Every human genome contains roughly 4-5 million positions where it differs from the reference sequence. Identifying these variants and determining which ones influence health and traits are two of the central tasks of modern genomics. Variant calling is the computational process of finding the variants; GWAS is the statistical framework for linking them to phenotypes.

Variant calling starts with aligned sequencing reads (BAM files) and asks, at each genomic position, whether the observed reads support a variant. The challenge is that not every apparent difference is a real variant — sequencing errors (1% per base for Illumina), mapping errors (reads from paralogous regions assigned to the wrong location), and PCR duplicates (identical reads from amplification rather than independent sampling) all create false variant signals. The GATK Best Practices pipeline addresses each issue: reads are aligned with BWA-MEM, duplicates are marked (Picard), and HaplotypeCaller performs local de novo assembly of the reads in active regions, then evaluates all possible haplotypes using a pair-HMM to calculate genotype likelihoods. Variant quality score recalibration (VQSR) uses known true variants (from dbSNP, HapMap) as training data to separate true variants from artifacts.

GWAS tests the association between genetic variants and a phenotype across many individuals. The typical design genotypes hundreds of thousands of SNPs (using genotyping arrays) in thousands to millions of people, imputes additional variants using reference panels, and tests each SNP for association using linear or logistic regression, including covariates for population structure (principal components), age, sex, and other confounders. Results are displayed as Manhattan plots — genomic position on the x-axis, -log10(p-value) on the y-axis — where significant peaks rise above the genome-wide threshold of 5e-8.

A GWAS peak identifies a region associated with a trait, not a causal mechanism. The lead SNP is usually in linkage disequilibrium with many other variants, any of which could be causal. Most associations (~90%) fall in noncoding regions, suggesting regulatory rather than protein-coding effects. Fine-mapping methods (FINEMAP, SuSiE) use LD structure to narrow the set of potentially causal variants. Integration with epigenomic data (which regulatory elements are active in the relevant tissue?), eQTL data (which variants affect gene expression?), and functional validation experiments is typically required to go from a statistical association to a biological mechanism. Despite these challenges, GWAS has identified thousands of robust trait-associated loci, transformed our understanding of the genetic architecture of complex diseases, and forms the foundation for polygenic risk scores used in personalized medicine.

Practice Questions 3 questions