Variant calling identifies positions where an individual's genome differs from a reference sequence, detecting single nucleotide variants (SNVs), small insertions/deletions (indels), and structural variants. Tools like GATK HaplotypeCaller use Bayesian models that integrate base quality scores, mapping quality, and local realignment to distinguish true variants from sequencing errors. Genome-wide association studies (GWAS) test whether any of these variants are statistically associated with a phenotype (disease, trait) across a population, typically testing millions of SNPs and correcting for multiple testing using a genome-wide significance threshold of p < 5e-8. Associated variants identify genomic regions, not necessarily causal genes.
Walk through the GATK Best Practices pipeline on a small dataset: align reads, mark duplicates, call variants, and filter. Then examine a published GWAS Manhattan plot and trace one significant peak to its genomic context — what genes are nearby? Is the lead SNP coding or regulatory? Is the causal variant known?
Every human genome contains roughly 4-5 million positions where it differs from the reference sequence. Identifying these variants and determining which ones influence health and traits are two of the central tasks of modern genomics. Variant calling is the computational process of finding the variants; GWAS is the statistical framework for linking them to phenotypes.
Variant calling starts with aligned sequencing reads (BAM files) and asks, at each genomic position, whether the observed reads support a variant. The challenge is that not every apparent difference is a real variant — sequencing errors (1% per base for Illumina), mapping errors (reads from paralogous regions assigned to the wrong location), and PCR duplicates (identical reads from amplification rather than independent sampling) all create false variant signals. The GATK Best Practices pipeline addresses each issue: reads are aligned with BWA-MEM, duplicates are marked (Picard), and HaplotypeCaller performs local de novo assembly of the reads in active regions, then evaluates all possible haplotypes using a pair-HMM to calculate genotype likelihoods. Variant quality score recalibration (VQSR) uses known true variants (from dbSNP, HapMap) as training data to separate true variants from artifacts.
GWAS tests the association between genetic variants and a phenotype across many individuals. The typical design genotypes hundreds of thousands of SNPs (using genotyping arrays) in thousands to millions of people, imputes additional variants using reference panels, and tests each SNP for association using linear or logistic regression, including covariates for population structure (principal components), age, sex, and other confounders. Results are displayed as Manhattan plots — genomic position on the x-axis, -log10(p-value) on the y-axis — where significant peaks rise above the genome-wide threshold of 5e-8.
A GWAS peak identifies a region associated with a trait, not a causal mechanism. The lead SNP is usually in linkage disequilibrium with many other variants, any of which could be causal. Most associations (~90%) fall in noncoding regions, suggesting regulatory rather than protein-coding effects. Fine-mapping methods (FINEMAP, SuSiE) use LD structure to narrow the set of potentially causal variants. Integration with epigenomic data (which regulatory elements are active in the relevant tissue?), eQTL data (which variants affect gene expression?), and functional validation experiments is typically required to go from a statistical association to a biological mechanism. Despite these challenges, GWAS has identified thousands of robust trait-associated loci, transformed our understanding of the genetic architecture of complex diseases, and forms the foundation for polygenic risk scores used in personalized medicine.