A data scientist wants to reduce 500-dimensional gene expression data to 10 dimensions as input features for a supervised classifier. She runs t-SNE to produce a 10-dimensional embedding and uses those coordinates as features. What is the fundamental problem with this approach?
At-SNE cannot handle more than 100 dimensions, so it will fail on 500-dimensional input
Bt-SNE is non-parametric: it cannot project new data points, so the test set cannot be embedded using the training embedding
Ct-SNE preserves too much global structure, making it unsuitable for classification tasks
D10 dimensions is too many for t-SNE, which only works for 2D or 3D output
t-SNE (and UMAP) are non-parametric methods — they produce an embedding of the training data but there is no learned function that can map new data points into the same space. This makes them unsuitable for feature engineering before supervised learning. PCA, by contrast, learns a linear projection that can be applied to any new data. The 2D/3D output restriction (option D) is a practical norm, not a hard limit, and option C has it backwards — t-SNE actually preserves local structure well but sacrifices global structure.
Question 2 Multiple Choice
You have high-dimensional data that you suspect lies on a curved, Swiss-roll-shaped manifold. You want to understand the cluster structure for a research presentation. Which method is most appropriate?
APCA, because it is interpretable and invertible
BICA, because it finds statistically independent components rather than just uncorrelated ones
Ct-SNE or UMAP, because they capture nonlinear manifold structure and reveal cluster geometry
DAn autoencoder, because it is the only parametric nonlinear method
PCA and ICA are linear methods — they project onto flat hyperplanes and will distort the intrinsic geometry of a curved manifold like a Swiss roll. t-SNE and UMAP are nonlinear methods specifically designed to reveal cluster structure in high-dimensional data for visualization. An autoencoder (option D) could learn the manifold but requires substantial training and is harder to interpret visually. For a research presentation aimed at understanding cluster structure, t-SNE or UMAP is the right tool.
Question 3 True / False
Unlike t-SNE and UMAP, a trained autoencoder encoder network can project new, unseen data points into the latent space.
TTrue
FFalse
Answer: True
This is the critical distinction between parametric and non-parametric methods. t-SNE and UMAP are non-parametric: they produce coordinates for the training data only, with no learned function applicable to new data. An autoencoder encoder is a trained neural network — a parametric function — that can accept any input and map it to the latent representation. This makes autoencoders usable for feature engineering and downstream tasks, while t-SNE and UMAP are visualization tools only.
Question 4 True / False
PCA is generally the best dimensionality reduction method for revealing complex cluster structure in high-dimensional biological data, because it is fast, interpretable, and widely used.
TTrue
FFalse
Answer: False
PCA can only capture linear relationships. If the meaningful structure in the data lies on a curved manifold — which is common in biological datasets like single-cell RNA sequencing — PCA's linear projections will distort or obscure that structure. Nonlinear methods like t-SNE and UMAP routinely reveal tight, well-separated clusters in biological data that PCA collapses into indistinguishable blobs. PCA is the right starting point for feature engineering before supervised models, but not for exploratory visualization of complex cluster geometry.
Question 5 Short Answer
Why can the axes in a t-SNE or UMAP embedding not be meaningfully interpreted or compared across runs, even when the overall cluster structure looks similar?
Think about your answer, then reveal below.
Model answer: t-SNE and UMAP are non-parametric optimization procedures — they find a low-dimensional configuration that preserves local neighborhood structure, but the solution is not unique and involves random initialization. The axes have no fixed interpretation (unlike PCA where each axis corresponds to a direction of maximum variance in the original space). Distances between clusters may also not be comparable between runs or between different embeddings of different datasets.
This matters practically: you cannot compare the x-axis value of a point across two different t-SNE runs and conclude anything about its relationship to points in the other run. The coordinate system is arbitrary. PCA avoids this: the first principal component always points in the direction of maximum variance in the data, giving each axis a consistent geometric meaning.