Questions: Chemometrics and Multivariate Data Analysis
5 questions to test your understanding
Score: 0 / 5
Question 1 Multiple Choice
A pharmaceutical analyst wants to predict tablet potency from near-IR spectra, but ordinary multiple linear regression fails when they include all 1500 spectral variables. Why does PLS regression succeed where OLS fails here?
APLS uses a larger training dataset than OLS requires
BPLS handles collinear variables by first compressing the spectrum into a small number of latent variables that capture the relevant spectral variation, whereas OLS breaks down when predictor variables are highly correlated with each other
CPLS automatically removes irrelevant wavelengths, leaving only the peak wavelengths for regression
DPLS is more accurate than OLS for any regression problem involving more than 100 variables
Adjacent wavelengths in a near-IR spectrum are highly correlated (collinear) — they convey nearly the same information. Ordinary least squares cannot handle collinear predictors: the matrix inversion required to solve the regression becomes numerically unstable, and small noise in the data produces wildly varying coefficients. PLS avoids this by finding latent variables — linear combinations of spectral variables that maximally covary with the response (concentration) — compressing thousands of correlated wavelengths into a handful of orthogonal factors. Regression proceeds on these factors, which are non-collinear by construction.
Question 2 Multiple Choice
In a PCA of UV-Vis spectra from 80 wine samples, the first two principal components explain 92% of the total variance. What do these principal components represent chemically?
AThe two wavelengths with the highest average absorbance in the dataset
BThe two wavelengths that best distinguish wine varieties from each other
COrthogonal directions in the high-dimensional spectral space that capture the greatest sources of systematic variation across samples — likely reflecting major chemical differences such as pigment concentration or pH
DThe mean spectrum and its standard deviation across all samples
Principal components are eigenvectors of the covariance matrix — they are directions in the original variable space (wavelength space) that capture maximum variance. They are not individual wavelengths but weighted combinations of all wavelengths. The first PC captures the direction of greatest spectral variation across samples; the second captures the next greatest orthogonal direction. Chemically, these often correspond to the dominant sources of chemical variation in the dataset (e.g., anthocyanin concentration, pH-dependent chromophore shifts). The 92% variance capture means most of the chemical information in 1000+ wavelengths is compressed into just two dimensions.
Question 3 True / False
In chemometrics, including more spectral variables (wavelengths) in a calibration model typically improves its predictive accuracy on new samples.
TTrue
FFalse
Answer: False
This is a fundamental misconception about multivariate modeling. Adding more variables past a certain point leads to overfitting — the model fits noise in the training data rather than genuine chemical signal — and degrades performance on new samples. Collinearity among spectral variables also causes numerical instability. Chemometrics methods like PCA and PLS work precisely because they exploit the redundancy in spectral data: the true dimensionality of the chemical problem is far smaller than the number of measured variables. Cross-validation is used to find the optimal number of principal components or latent variables — enough to capture real signal, few enough to avoid fitting noise.
Question 4 True / False
PCA finds principal components that are eigenvectors of the covariance matrix of the data, ordered by decreasing eigenvalue, where each eigenvalue represents the variance explained by that component.
TTrue
FFalse
Answer: True
This is the direct mathematical definition of PCA. The covariance matrix encodes how all pairs of variables co-vary across samples. Its eigenvectors are the principal component directions — orthogonal axes in the original variable space along which the data is most spread out. The corresponding eigenvalues quantify the variance along each direction. Ordering by decreasing eigenvalue ensures the first few components capture the most information. In spectral chemometrics, this typically means two or three components capture 90–99% of the total variance, demonstrating the dramatic redundancy in spectral data.
Question 5 Short Answer
Explain the key insight behind applying PCA to chemical spectral data, and why two or three components often capture most of the information in spectra with thousands of variables.
Think about your answer, then reveal below.
Model answer: The key insight is that chemical spectral data is highly redundant: a spectrum with 2000 wavelength values is not 2000-dimensional in any meaningful chemical sense. Adjacent wavelengths are highly correlated (they measure the same absorbance event), and the spectrum as a whole is determined by a small number of independent chemical sources of variation (a few analyte concentrations, pH, solvent effects). PCA discovers this low-dimensional structure by finding orthogonal directions of maximum variance in the data. Because the 'true' chemical variation spans only a few independent factors, the first two or three principal components — which capture the largest variances — account for most of the information. The remaining components capture instrument noise, minor baseline shifts, and random measurement error.
A concrete analogy: if you measure 50 people's heights in both inches and centimeters, you have 2 variables but only 1 dimension of real variation. PCA would find one component explaining ~100% of variance. Spectra are the same idea but with thousands of correlated variables and perhaps 3–10 independent chemical factors. Chemometrics works because chemistry imposes structure on the data — the number of underlying chemical sources of variation is small, even when the number of measured variables is large.