Questions: Multivariate Calibration: PLS and PCR Models
5 questions to test your understanding
Score: 0 / 5
Question 1 Multiple Choice
A chemist builds both a PCR model and a PLS model for predicting glucose concentration from near-infrared spectra of blood plasma. The plasma also strongly absorbs at wavelengths associated with albumin, which is unrelated to glucose. Which statement best explains why PLS typically achieves better glucose predictions with fewer components?
APLS normalizes the spectra first, removing albumin absorption automatically
BPLS finds latent variables that maximize covariance with glucose concentration, so albumin-related spectral variation is deprioritized
CPCR is mathematically invalid for overlapping spectra, making PLS the only valid choice
DPLS uses more calibration samples than PCR, giving it an inherent accuracy advantage
PCR finds latent variables (principal components) that capture maximum spectral variance — but albumin's strong absorption is high-variance and will dominate early PCs even though it is irrelevant to glucose. PLS instead finds latent variables that maximize the *covariance* between the spectra and the glucose concentration response variable, so it prioritizes spectral patterns correlated with glucose and ignores albumin variation. Fewer components are needed because those components are directly relevant to the prediction task.
Question 2 Multiple Choice
During cross-validation of a PLS model, the prediction error decreases as the number of latent variables increases from 1 to 6, reaches a minimum at 6 components, and then begins increasing. What is the best interpretation of this pattern?
AThe true underlying model has exactly 6 independent chemical factors contributing to the signal
BThe model overfits noise when more than 6 components are included, even though training error would continue to fall
CSix components is the mathematical maximum for this dataset, so more cannot be added
DThe cross-validated error increasing after 6 components indicates the calibration samples are outliers
This is the classic bias-variance tradeoff in action. The training (calibration) error generally continues to fall as more components are added, because each new component can capture additional variance — including noise specific to the calibration set. Cross-validated error penalizes overfitting: when a component captures noise rather than real signal, it hurts prediction on left-out samples. The minimum cross-validated error at 6 components identifies the optimal model complexity — enough to capture real chemical signals without fitting idiosyncratic noise in the calibration data.
Question 3 True / False
PLS models for spectral data typically require fewer latent variables than PCR models to achieve the same predictive accuracy.
TTrue
FFalse
Answer: True
True. PCR selects components based on spectral variance, which may be dominated by interferents or instrument noise unrelated to the analyte. PLS selects components based on covariance between spectra and concentration, so the first few PLS components are specifically targeted at the analyte's contribution. This more efficient use of dimensionality means PLS typically reaches comparable prediction accuracy with fewer components — reducing the risk of overfitting and making the model more interpretable.
Question 4 True / False
Ordinary least squares regression (OLS) can be reliably applied to multivariate spectral calibration problems whenever the number of calibration samples exceeds the number of wavelengths measured.
TTrue
FFalse
Answer: False
False. The problem is not just having enough samples — it is collinearity. Adjacent wavelengths in a spectrum are highly correlated (nearly identical information), which makes the matrix inversion in OLS numerically unstable or singular even when samples outnumber wavelengths. High collinearity inflates coefficient variances enormously and produces wildly unstable predictions. PCR and PLS solve this by first compressing the correlated wavelengths into a small number of uncorrelated latent variables, then performing regression on those — bypassing the collinearity problem entirely.
Question 5 Short Answer
Explain why the number of latent variables (components) in a PLS or PCR model must be determined by cross-validation rather than simply choosing the number that minimizes training error.
Think about your answer, then reveal below.
Model answer: Training error always decreases as more components are added because additional components can fit idiosyncratic noise in the calibration data. Cross-validation withholds subsets of calibration samples, tests prediction on them, and penalizes overfitting — components that fit noise in training samples will predict poorly on left-out samples. The minimum cross-validated error identifies the optimal number of components: enough to capture real spectral-concentration relationships, but not so many that the model memorizes noise specific to the calibration set.
This is the model selection problem. A model with too few components underfits — it misses real chemical signals. A model with too many overfits — it learns the noise patterns specific to the calibration samples and fails on new samples from the same analytical system. Since the goal is accurate prediction on future samples (not perfect fit to calibration data), the selection criterion must be predictive accuracy on held-out data, which is exactly what cross-validation measures.