In SimCLR, the composition of multiple data augmentations (e.g., random cropping plus color jitter) matters far more than any single augmentation applied alone. What is the best explanation for this?
AMultiple augmentations increase the number of positive pairs per batch, directly improving optimization speed
BComposing augmentations creates views that differ on many dimensions simultaneously, forcing the model to learn invariances to all of them at once
CIndividual augmentations don't change pixel statistics enough for the contrastive loss to compute meaningful gradients
DMultiple augmentations reduce data leakage between positive and negative pairs in the batch
Each augmentation destroys different information: color jitter makes the network ignore exact color; random cropping forces it to ignore absolute position and scale. When composed, the two augmented views differ along all these dimensions simultaneously, so the only way the network can make them similar is by encoding the semantic content that survives all augmentations — typically object identity, shape, and texture. A single augmentation teaches only one invariance; composing augmentations teaches a richer, more transferable set.
Question 2 Multiple Choice
A researcher trains SimCLR with a batch size of 64 instead of the original 4096. They observe much worse downstream performance. What is the most direct cause?
ASmaller batches cause gradient instability in the projection head, corrupting the representation
BWith only 64 images per batch, each anchor has just 126 negatives — a weak discrimination signal compared to the 8190 negatives available at batch size 4096
CSmall batches make augmentation composition less effective because fewer augmentation combinations are sampled
DThe InfoNCE loss is undefined when batch size falls below 128
The number of negatives per anchor in SimCLR is 2(N-1). At batch size 64, each anchor has only 126 negatives — making it relatively easy to distinguish the positive pair. At batch size 4096, each anchor must pick its positive from among 8190 negatives — a much harder task requiring genuinely discriminative features. Harder negatives produce richer gradients and force the network to learn more informative representations. MoCo was specifically designed to decouple negative count from batch size using a momentum queue.
Question 3 True / False
Contrastive learning trains the model to map two augmented views of the same image to nearby points in representation space, and this implicitly teaches the network which features are semantically invariant.
TTrue
FFalse
Answer: True
The augmentations destroy information that is irrelevant to semantic content (exact position, color balance, scale) while preserving information that defines it (object identity, shape, texture). By forcing the model to map two very different-looking views of the same image to nearby points, the contrastive loss teaches the network to encode only the invariant semantic signal. The augmentation design is therefore not arbitrary — it is the mechanism by which the self-supervised learning signal is constructed.
Question 4 True / False
BYOL and SimSiam demonstrate that explicit negative pairs are essential to prevent representational collapse in contrastive learning.
TTrue
FFalse
Answer: False
BYOL and SimSiam achieve competitive performance using only positive pairs — no negatives at all. They prevent collapse through other mechanisms: asymmetric architectures (predictor head on one branch), stop-gradient operations, and momentum encoders. These results showed that the core mechanism is learning augmentation-invariant representations, not contrastive discrimination per se — negatives are one way to prevent collapse, but not the only way.
Question 5 Short Answer
Why does the choice of data augmentation strategy define what contrastive learning 'means' semantically, rather than being a mere implementation detail?
Think about your answer, then reveal below.
Model answer: The augmentations define which properties of the data are treated as irrelevant noise versus meaningful signal. Whatever information is consistently destroyed by the augmentations will be discarded from the representation; whatever survives all augmentations will be preserved. Since the model has no labels, the augmentations are the only mechanism that defines 'semantic similarity' — two images are treated as semantically identical if they are augmentations of each other.
This is why the same contrastive framework produces very different representations depending on the augmentation suite. SimCLR with standard image augmentations learns visual features useful for object recognition. Apply contrastive learning to audio spectrograms with time-masking augmentations and you learn speech features. The math is identical; the semantics are entirely determined by what the augmentations preserve.