Systems biology studies biological processes as integrated systems rather than isolated components, using computational models to understand how genes, proteins, metabolites, and their interactions give rise to cellular behavior. Data integration combines multiple omics datasets (transcriptomics, proteomics, metabolomics, epigenomics) with pathway databases and interaction networks to build holistic models. Key analytical approaches include pathway enrichment analysis (Gene Ontology, KEGG), network-based analysis (protein-protein interaction networks, gene co-expression networks), and constraint-based metabolic modeling (flux balance analysis). The goal is to move from lists of differentially expressed genes to mechanistic understanding of biological processes.
Take a list of differentially expressed genes from an RNA-seq experiment and perform Gene Ontology enrichment analysis (using clusterProfiler or DAVID). Then map the same genes onto KEGG pathways and a protein-protein interaction network (STRING). Compare what each approach reveals and note how they complement each other.
Individual omics experiments produce lists: differentially expressed genes, altered metabolites, modified proteins. But biology operates as interconnected systems, not lists. Systems biology aims to understand how the interactions between molecular components produce the behaviors of cells, tissues, and organisms. Data integration — combining multiple types of molecular measurements with prior knowledge about pathways and interactions — is the central computational challenge.
Pathway enrichment analysis is usually the first integration step. Given a list of differentially expressed genes, enrichment analysis asks: are any known biological pathways or functional categories disproportionately represented? Gene Ontology (GO) provides a hierarchical vocabulary of biological processes, molecular functions, and cellular components. KEGG provides curated metabolic and signaling pathway maps. Reactome provides detailed reaction-level pathway models. Over-representation analysis (ORA) tests each pathway using a hypergeometric test; Gene Set Enrichment Analysis (GSEA) ranks all genes by their expression change and tests whether pathway members cluster at the top or bottom of the ranking. These approaches convert gene lists into biological narratives.
Network analysis adds another dimension. Protein-protein interaction (PPI) networks from databases like STRING and BioGRID map the physical and functional connections between proteins. Gene co-expression networks (built from RNA-seq data using WGCNA) identify modules of genes that vary together across conditions. Overlaying differential expression data onto these networks reveals which modules are perturbed and identifies hub genes — highly connected nodes whose disruption affects many downstream partners. Network propagation algorithms spread experimental signal through the network, identifying genes that are not themselves differentially expressed but are strongly connected to genes that are, potentially revealing upstream regulators or downstream effectors.
Multi-omics integration is the frontier. Combining transcriptomics, proteomics, metabolomics, and epigenomics from the same samples provides complementary views of the same biological system. Transcripts show regulatory changes, proteins show functional capacity, metabolites show biochemical output, and epigenomic marks show regulatory state. Statistical methods for integration range from simple (overlapping significant results from each layer) to sophisticated (multivariate methods like MOFA, network-based integration like iNetModules, and causal inference frameworks). The emerging paradigm is that no single omics layer tells the full story — diseases, drug responses, and developmental processes are best understood by examining how perturbations propagate across molecular layers, from genome to phenome.