Variant Calling and Genome-Wide Association Studies

Research Depth 180 in the knowledge graph I know this Set as goal
Unlocks 6 downstream topics
variant-calling GWAS SNP GATK Manhattan-plot linkage-disequilibrium

Core Idea

Variant calling identifies positions where an individual's genome differs from a reference sequence, detecting single nucleotide variants (SNVs), small insertions/deletions (indels), and structural variants. Tools like GATK HaplotypeCaller use Bayesian models that integrate base quality scores, mapping quality, and local realignment to distinguish true variants from sequencing errors. Genome-wide association studies (GWAS) test whether any of these variants are statistically associated with a phenotype (disease, trait) across a population, typically testing millions of SNPs and correcting for multiple testing using a genome-wide significance threshold of p < 5e-8. Associated variants identify genomic regions, not necessarily causal genes.

How It's Best Learned

Walk through the GATK Best Practices pipeline on a small dataset: align reads, mark duplicates, call variants, and filter. Then examine a published GWAS Manhattan plot and trace one significant peak to its genomic context — what genes are nearby? Is the lead SNP coding or regulatory? Is the causal variant known?

Common Misconceptions

Explainer

Every human genome contains roughly 4-5 million positions where it differs from the reference sequence. Identifying these variants and determining which ones influence health and traits are two of the central tasks of modern genomics. Variant calling is the computational process of finding the variants; GWAS is the statistical framework for linking them to phenotypes.

Variant calling starts with aligned sequencing reads (BAM files) and asks, at each genomic position, whether the observed reads support a variant. The challenge is that not every apparent difference is a real variant — sequencing errors (1% per base for Illumina), mapping errors (reads from paralogous regions assigned to the wrong location), and PCR duplicates (identical reads from amplification rather than independent sampling) all create false variant signals. The GATK Best Practices pipeline addresses each issue: reads are aligned with BWA-MEM, duplicates are marked (Picard), and HaplotypeCaller performs local de novo assembly of the reads in active regions, then evaluates all possible haplotypes using a pair-HMM to calculate genotype likelihoods. Variant quality score recalibration (VQSR) uses known true variants (from dbSNP, HapMap) as training data to separate true variants from artifacts.

GWAS tests the association between genetic variants and a phenotype across many individuals. The typical design genotypes hundreds of thousands of SNPs (using genotyping arrays) in thousands to millions of people, imputes additional variants using reference panels, and tests each SNP for association using linear or logistic regression, including covariates for population structure (principal components), age, sex, and other confounders. Results are displayed as Manhattan plots — genomic position on the x-axis, -log10(p-value) on the y-axis — where significant peaks rise above the genome-wide threshold of 5e-8.

A GWAS peak identifies a region associated with a trait, not a causal mechanism. The lead SNP is usually in linkage disequilibrium with many other variants, any of which could be causal. Most associations (~90%) fall in noncoding regions, suggesting regulatory rather than protein-coding effects. Fine-mapping methods (FINEMAP, SuSiE) use LD structure to narrow the set of potentially causal variants. Integration with epigenomic data (which regulatory elements are active in the relevant tissue?), eQTL data (which variants affect gene expression?), and functional validation experiments is typically required to go from a statistical association to a biological mechanism. Despite these challenges, GWAS has identified thousands of robust trait-associated loci, transformed our understanding of the genetic architecture of complex diseases, and forms the foundation for polygenic risk scores used in personalized medicine.

Practice Questions 3 questions

Prerequisite Chain

Counting to 10Counting to 20Understanding ZeroThe Number ZeroCounting to FiveOne-to-One CorrespondenceCombining Small Groups Within 5Addition Within 10Addition Within 20Two-Digit Addition Without RegroupingTwo-Digit Addition with RegroupingAddition Within 100Repeated Addition as MultiplicationMultiplication Facts Within 100Division as Equal SharingDivision as Grouping (Measurement Division)Division: Grouping (Repeated Subtraction) ModelDivision: Fair Sharing ModelDivision as Equal SharingDivision as GroupingBasic Division FactsDivision Facts Within 100Two-Digit by One-Digit DivisionDivision with RemaindersRemainders and Quotients in DivisionDivision Word ProblemsIntroduction to Long DivisionFactors and MultiplesPrime and Composite NumbersEquivalent FractionsRelating Fractions and DecimalsDecimal Place ValueReading and Writing DecimalsComparing and Ordering DecimalsAdding and Subtracting DecimalsMultiplying DecimalsDividing DecimalsDividing FractionsMixed Number ArithmeticOrder of OperationsInteger Order of OperationsVariable ExpressionsCombining Like TermsOne-Step EquationsTwo-Step EquationsSolving Multi-Step EquationsEquations with Variables on Both SidesAngle Pairs: Complementary, Supplementary, and VerticalParallel Lines and TransversalsCorresponding AnglesAlternate Interior AnglesTriangle Angle Sum TheoremExterior Angle TheoremTriangle Inequality TheoremSimilar Triangles: AA SimilaritySimilar Triangles: SSS and SAS SimilarityProportions in Similar TrianglesRight Triangle Trigonometry IntroductionTrigonometric Ratios ReviewRadian MeasureConverting Between Degrees and RadiansThe Unit CircleGraphing Sine and CosineGraphing Tangent and Reciprocal Trigonometric FunctionsDerivatives of Trigonometric FunctionsAntiderivativesIterated Integrals and Fubini's TheoremDouble Integrals in Cartesian CoordinatesDouble Integrals over Rectangular RegionsDouble Integrals in Polar CoordinatesDouble Integrals: Definition and SetupIterated Integrals and Fubini's TheoremDouble Integrals over Rectangular RegionsDouble Integrals over General RegionsApplications of Double Integrals: Area, Mass, and MomentsTriple Integrals in Cartesian CoordinatesTriple Integrals in Cylindrical and Spherical CoordinatesChange of Variables and the Jacobian DeterminantApplications of Triple Integrals: Volume and MassVector Fields and Their RepresentationsLine Integrals of Vector FieldsGreen's TheoremSurface Integrals and Flux of Vector FieldsSurface Integrals and Flux of Vector FieldsDivergence Theorem: Flux and OutflowDivergence TheoremElectric FluxGauss's LawConductors in Electrostatic EquilibriumCapacitance and CapacitorsDielectricsDielectric Constant and Relative PermittivityElectric Field Inside Dielectric MaterialsDielectric Materials and PolarizationDielectric Susceptibility and PermittivityEnergy Density in Electric FieldsElectric Current and Current DensityElectrical Resistance and ResistivityOhm's Law and Circuit ElementsElectromotive Force (EMF) and BatteriesKirchhoff's Circuit Laws: Voltage and CurrentDC Circuit Network Analysis MethodsTransient Response in RC CircuitsRC CircuitsLC and RLC CircuitsAC Circuits: FundamentalsImpedance and ReactanceAC Power and ResonanceElectromagnetic WavesThe Electromagnetic SpectrumBlackbody Radiation and Planck's LawPhotoelectric EffectThe Photon: Light as QuantaCompton ScatteringWave-Particle Dualityde Broglie WavelengthHeisenberg Uncertainty PrincipleWavefunction and the Born RuleThe Schrödinger EquationState Vectors and WavefunctionsQuantum SuperpositionQuantum EntanglementBell Theorem and Bell InequalitiesPostulates of Quantum MechanicsScattering TheoryIntroduction to Scattering TheoryPartial Wave Analysis in ScatteringSpin Angular MomentumElectron Spin and Intrinsic Magnetic MomentStern-Gerlach Experiment: Spin Quantization and MeasurementElectron Diffraction and Matter Wave PropertiesDavisson-Germer Experiment: Crystal Diffraction of ElectronsElectron Diffraction and Matter Wave InterferenceWavefunctions and Probability Density InterpretationQuantum Superposition and Linear Combinations of StatesQuantum Operators and ObservablesCanonical Commutation Relations and UncertaintyHeisenberg Uncertainty Principle and Measurement LimitsTime-Independent Schrödinger Equation and EigenvaluesHydrogen Atom in Quantum MechanicsSpectral Lines and Energy TransitionsSelection Rules for Atomic TransitionsLS and jj Coupling Schemes in Multi-Electron AtomsPauli Exclusion Principle and Antisymmetric WavefunctionsElectron Configuration and the Aufbau PrincipleThe Periodic Table and Atomic Electronic StructureThe Periodic TableElectron ConfigurationPeriodic TrendsIonization EnergyIonic BondingLewis StructuresResonance Structures and Delocalized ElectronsResonance and Formal ChargeMolecular Polarity and Dipole MomentsIntermolecular ForcesStates of Matter and Phase Changes: Melting, Boiling, and SublimationGas Laws and the Ideal Gas EquationGas Stoichiometry and Volume-Volume CalculationsThermochemistry and EnthalpyHeat Capacity and CalorimetryEntropy and Molecular DisorderSpontaneity and ΔGEntropy and Gibbs Free EnergyChemical EquilibriumAcid-Base ChemistryOrganic Reaction Mechanisms and Arrow PushingElectrophilic Addition to AlkenesAromaticity and BenzeneDNA StructureCentral Dogma of Molecular BiologyTranscription: DNA to RNARNA Types and StructureRNA Processing and SplicingTranslation: RNA to ProteinGene Regulation in ProkaryotesGene Regulation in EukaryotesEpigeneticsGenomics and DNA SequencingSingle Nucleotide Polymorphisms and Genetic VariationVariant Calling and Genome-Wide Association Studies

Longest path: 181 steps · 896 total prerequisite topics

Prerequisites (4)

Leads To (4)