Explain why raw read counts cannot be directly compared between genes of different lengths to assess relative expression levels.
Think about your answer, then reveal below.
Model answer: A longer gene captures more sequencing reads simply because it presents a larger target for random fragmentation and sequencing, not because it is more highly expressed. A 10-kb gene will accumulate roughly 10 times more reads than a 1-kb gene at the same expression level. Without normalizing for gene length, raw counts systematically overestimate the expression of long genes relative to short ones. Length normalization (dividing counts by gene length) corrects this bias, allowing fair comparison of expression levels across genes of different sizes.
This is a fundamental sampling bias in RNA-seq. During library preparation, RNA molecules are fragmented, and each fragment has an equal probability of being sequenced. Longer transcripts produce more fragments, hence more reads, for the same number of original RNA molecules. This is why all standard expression metrics (RPKM, FPKM, TPM) include a length normalization step.