RNA-seq quantifies gene expression by sequencing the RNA molecules present in a sample. The standard analysis pipeline involves quality control of raw reads, alignment to a reference genome or transcriptome (using splice-aware aligners like STAR or HISAT2), quantification of reads per gene or transcript, and normalization to account for sequencing depth and gene length differences. Key normalization metrics include TPM (transcripts per million) for within-sample comparisons and methods like DESeq2's size factors for between-sample comparisons. The pipeline transforms raw sequencing data into a gene-by-sample expression matrix suitable for downstream analysis.
Process a small RNA-seq dataset end-to-end: run FastQC on raw reads, trim adapters with Trimmomatic, align to a reference with STAR, count reads per gene with featureCounts, and normalize. Compare raw counts to TPM values for a housekeeping gene versus a tissue-specific gene to understand why normalization matters.
RNA-seq has become the standard method for measuring gene expression genome-wide. Rather than measuring predetermined targets (like microarrays), RNA-seq sequences whatever RNA is present in the sample, providing an unbiased, quantitative snapshot of the transcriptome. But going from raw sequencing reads to reliable expression estimates requires a multi-step pipeline, each step with important decisions that affect the final results.
The pipeline begins with quality control and preprocessing. FastQC or MultiQC examines raw reads for adapter contamination, quality score distributions, GC content bias, and duplication levels. Adapter sequences (ligated during library preparation) are trimmed, and low-quality bases are removed. This step is straightforward but essential — contaminated or low-quality reads introduce noise and waste computational resources in alignment.
Alignment maps reads to their genomic origin. Because mRNA has been spliced, reads that span exon-exon junctions must be split across the intron in the genome alignment. Splice-aware aligners like STAR and HISAT2 use known splice site annotations (and can discover novel junctions) to handle these split reads correctly. An alternative approach, pseudoalignment (Salmon, kallisto), skips genomic alignment entirely and quantifies expression by matching reads to a transcriptome reference, trading some information (genomic location) for dramatic speed improvements. The choice depends on whether downstream analyses need genomic coordinates (variant calling, splice analysis) or only gene/transcript quantification.
Quantification counts how many reads map to each gene or transcript. Tools like featureCounts and HTSeq-count assign aligned reads to genomic features using gene annotation files. The output is a count matrix: rows are genes, columns are samples, and each entry is the number of reads observed for that gene in that sample. These raw counts must then be normalized to be interpretable. Within-sample normalization (TPM, FPKM) corrects for gene length and sequencing depth, enabling comparison of expression levels between genes in the same sample. Between-sample normalization (DESeq2's median-of-ratios, edgeR's TMM) adjusts for differences in library composition and size between samples, enabling differential expression analysis — the subject of the next topic.
The entire pipeline, from raw FASTQ files to a normalized expression matrix, can be run using workflow managers like Nextflow (nf-core/rnaseq) or Snakemake, which ensure reproducibility and handle the orchestration of multiple tools. Understanding each step is nonetheless essential, because parameter choices at every stage — alignment stringency, multi-mapping handling, counting mode, normalization method — affect the biological conclusions drawn from the data.