In a bottom-up proteomics experiment, what is the database search step doing?
AAligning protein sequences to a reference genome
BMatching observed peptide mass spectra to theoretical spectra generated from a protein sequence database
CSearching for homologous proteins in other species
DIdentifying post-translational modifications by comparing to a known modification database
After LC-MS/MS, each peptide produces a fragmentation spectrum — a pattern of fragment ion masses. The database search engine (Mascot, Sequest, Andromeda) takes every protein in the database, computationally digests it into peptides, generates theoretical fragmentation spectra for each peptide, and compares these to the observed spectra. The best-matching peptide-spectrum match is scored and evaluated for statistical significance. This is fundamentally a pattern-matching problem between observed data and a theoretical reference.
Question 2 True / False
Protein abundance in a cell can be accurately predicted from mRNA expression levels alone.
TTrue
FFalse
Answer: False
The correlation between mRNA and protein levels is typically only 0.4-0.6, meaning transcript abundance explains less than half of the variance in protein abundance. Post-transcriptional regulation (miRNAs, RNA-binding proteins), differences in translation efficiency (codon usage, ribosome availability, mRNA structure), and differences in protein stability and degradation rates all contribute to the discrepancy. This is precisely why proteomics is necessary alongside transcriptomics — RNA-seq tells you what could be made, but proteomics tells you what is actually present.
Question 3 Short Answer
Explain the concept of false discovery rate (FDR) in peptide identification and how the target-decoy approach controls it.
Think about your answer, then reveal below.
Model answer: In proteomics, every observed spectrum is matched to the best peptide in the database, but some matches will be incorrect — the spectrum came from a peptide not in the database, or noise produced a spurious match. The target-decoy approach controls FDR by searching against both the real (target) protein database and a shuffled or reversed (decoy) database. Any match to the decoy database is by definition a false positive. The FDR is estimated as: (2 x decoy hits) / (total hits above threshold). By adjusting the score threshold until the FDR reaches the desired level (typically 1%), the analysis controls the proportion of false identifications in the final results.
This is analogous to the Benjamini-Hochberg correction in genomics but uses the decoy database as an empirical null distribution rather than a theoretical one. The target-decoy approach has become the standard in proteomics because it accounts for the specific characteristics of each dataset's noise rather than relying on distributional assumptions.