A machine learning team has 200 labeled examples and 200,000 unlabeled examples. They apply a semi-supervised method and find it performs worse than a supervised model trained only on the 200 labeled examples. What is the most likely explanation?
A200,000 unlabeled examples is too many; semi-supervised methods work best with a 1:10 labeled-to-unlabeled ratio
BThe cluster assumption does not hold — class boundaries pass through dense regions of the feature space, so unlabeled data misleads the model
CSemi-supervised learning requires at least 1,000 labeled examples to function properly
DThe model architecture was too simple to exploit the unlabeled data structure
Semi-supervised methods assume that data in the same cluster share a label, so the unlabeled data reveals cluster structure that guides the decision boundary into low-density gaps. When this assumption fails — when class boundaries run through the middle of dense clusters — unlabeled data actively misleads the model, pushing decision boundaries into the wrong places. More unlabeled data then makes things worse, not better. The cluster (or smoothness) assumption is a prerequisite, not a guarantee.
Question 2 Multiple Choice
In self-training (pseudo-labeling), a model assigns confident predictions to unlabeled examples and adds them to the training set. What is the primary risk of this approach?
AThe model will label too few examples, failing to benefit from the unlabeled data
BConfident but incorrect pseudo-labels compound through subsequent retraining iterations, amplifying early errors
CThe approach violates the i.i.d. assumption because pseudo-labels are correlated with the original predictions
DThe model will overfit the labeled data because pseudo-labels lack the diversity of real annotations
Self-training's fundamental risk is error propagation. If the initial model makes a confident but wrong prediction, that pseudo-label enters the training set, reinforcing the mistake in the next iteration. The next model becomes more confidently wrong on those examples, labels more similar examples incorrectly, and the error compounds. Confidence thresholds mitigate but do not eliminate this — the initial model must be reasonably accurate, and the threshold must be high enough to filter out most mistakes.
Question 3 True / False
Semi-supervised methods like FixMatch rely on the principle that a model's prediction should be consistent across different augmented views of the same unlabeled example, which pushes decision boundaries away from dense data regions.
TTrue
FFalse
Answer: True
This is consistency regularization, the key principle behind methods like MixMatch, UDA, and FixMatch. By penalizing prediction differences between weakly and strongly augmented versions of the same input, the model is forced to place its decision boundary where small perturbations don't flip the prediction — which tends to be in low-density gaps between clusters. This is more principled than raw pseudo-labeling because it doesn't require the initial model to make correct predictions, only consistent ones.
Question 4 True / False
Adding more unlabeled data to a semi-supervised learning system will typically improve or at least not harm model performance compared to supervised learning on the labeled set alone.
TTrue
FFalse
Answer: False
This is a common and dangerous misconception. When the cluster assumption fails, unlabeled data actively degrades performance by steering the decision boundary in the wrong direction. Semi-supervised methods can legitimately underperform a purely supervised baseline when class boundaries are not aligned with density structure. This is well-documented empirically. The decision to use SSL should depend on whether the data distribution satisfies the assumption, not on the availability of unlabeled data.
Question 5 Short Answer
What is the cluster assumption in semi-supervised learning, and why does whether it holds determine whether SSL helps or hurts?
Think about your answer, then reveal below.
Model answer: The cluster assumption states that data points in the same cluster in feature space tend to share the same class label — equivalently, that decision boundaries should pass through low-density regions between clusters, not through dense regions. When this holds, unlabeled data reveals cluster structure (which clusters exist and where they are), and even a few labeled points per cluster are enough to assign labels to the whole cluster. When it fails, the cluster structure is irrelevant to or contradicts the class boundaries, so unlabeled data misleads the model about where to place those boundaries.
The cluster assumption is the core precondition that makes SSL useful. It connects unsupervised structure (density, clusters) to supervised signal (labels). Without it, unlabeled data provides no useful information about the decision boundary and may actively corrupt the model by pushing boundaries into class-dense regions.