Questions: Chemometrics: Multivariate Calibration and Data Analysis
5 questions to test your understanding
Score: 0 / 5
Question 1 Multiple Choice
A chemist builds a PLS model for predicting glucose in blood plasma from near-IR spectra, using 15 latent variables. The model predicts the training set with excellent accuracy but performs poorly on new patient samples. What is the most likely explanation?
ANear-IR spectroscopy is inherently too insensitive for glucose in complex biological matrices
BThe model has overfit the training data by including too many latent variables, learning noise rather than true chemical signal
CPLS regression is not appropriate for biological samples with variable composition
DThe training set was too large, which reduced the model's sensitivity to individual samples
Excellent training-set performance combined with poor prediction on new samples is the classic overfitting signature. With 15 latent variables, the model has likely memorized instrumental artifacts, sample-specific noise, and other patterns in the training data that do not generalize to new samples. Cross-validation on the training set would have identified the number of components at which prediction error on held-out subsets starts increasing — the optimal complexity point.
Question 2 Multiple Choice
Why is PLS regression preferred over PCA for building quantitative concentration prediction models in chemometrics?
APCA is computationally too expensive for large spectral datasets
BPCA is a supervised method that already incorporates concentration information
CPLS finds latent variables that simultaneously capture spectral variance AND correlate with target concentration; PCA is unsupervised and may find variance directions unrelated to the analyte
DPLS requires fewer calibration standards than PCA to build a reliable model
PCA is unsupervised — it finds directions of maximum spectral variance without regard to concentration. The dominant principal component might capture instrument drift or spectral baseline variation, not the analyte signal. PLS is supervised: it explicitly seeks latent variables that are most predictive of the target variable. This supervision makes PLS far more efficient for quantitative calibration, especially in complex matrices where irrelevant variance (interfering components, baseline) dominates the raw spectral variation.
Question 3 True / False
Adding more spectral variables (wavelengths) to a chemometric calibration model usually improves prediction accuracy because more information is generally beneficial.
TTrue
FFalse
Answer: False
This is the overfitting fallacy. Beyond an optimal number of latent variables, additional components fit noise and instrumental artifacts in the training data rather than real chemical signal. Models with too many components show excellent training-set error but deteriorating prediction on independent samples. Proper cross-validation identifies the inflection point where additional complexity stops helping and starts hurting.
Question 4 True / False
Cross-validation is essential in chemometric model building because it provides an unbiased estimate of prediction performance on new samples and helps identify the appropriate number of latent variables.
TTrue
FFalse
Answer: True
Cross-validation leaves out subsets of training data, builds the model on the remainder, and tests prediction on the held-out subset — cycling through all subsets. The number of latent variables that minimizes cross-validation prediction error (not training-set error) is the optimal model complexity. Without this, a chemometrician cannot distinguish a model that has learned chemistry from one that has memorized training-set noise.
Question 5 Short Answer
What fundamental limitation of univariate calibration does multivariate calibration (e.g., PLS) overcome, and what new risk does it introduce?
Think about your answer, then reveal below.
Model answer: Univariate calibration fails when multiple analytes or interferents have overlapping signals — a single wavelength cannot selectively quantify one component in a complex mixture, and the calibration relationship breaks down as sample composition varies. PLS overcomes this by using the full spectral fingerprint across all wavelengths, exploiting subtle covariance patterns to separate analyte signal from interferents and predict concentration despite overlap. The new risk introduced is overfitting: with hundreds of wavelengths available, the model can learn spectral patterns specific to the training samples (noise, baseline drift, instrument-specific artifacts) that do not generalize. Rigorous cross-validation is the essential control.
The power of chemometrics is that it makes previously impossible measurements routine — simultaneously quantifying five components from a single spectrum, or classifying authentic versus adulterated foods from spectral fingerprints. The risk is that the same flexibility allows the construction of models that appear accurate but are actually wrong. The discipline of validation — cross-validation, independent test sets, and ongoing model maintenance — is what separates chemometrics done well from chemometrics that produces confidently incorrect answers.