What is the core principle of contrastive learning for representation learning?
AMaximize the loss on all training examples equally
BLearn representations where similar examples are close in embedding space and dissimilar examples are far apart
CMinimize all distances between examples to encourage clustering
DUse only positive pairs and ignore negative pairs to focus on similarity
Contrastive learning brings similar examples (positive pairs) close in embedding space while pushing dissimilar examples (negative pairs) apart. This creates a representation where semantic similarity is captured geometrically. The similarity structure is typically defined by augmentation (images with similar augmentations are positive) or prior labels (same class = positive). The tradeoff between pushing similar and dissimilar examples is the core learning objective.
Question 2 Short Answer
Contrastive loss functions (e.g., NT-Xent loss used in SimCLR) relate to which information-theoretic quantity?
Think about your answer, then reveal below.
Model answer: Contrastive loss relates to mutual information maximization. Specifically, minimizing contrastive loss is equivalent to maximizing mutual information between the two augmented views of the same instance (positive pair). The NT-Xent (normalized temperature-scaled cross-entropy) loss can be viewed as a lower bound on mutual information between positive pairs, achieved through noise-contrastive estimation. By maximizing I(z_i; z_j) for positive pairs (i, j), the model learns representations that capture shared structure while discarding view-specific details.
The information-theoretic grounding of contrastive learning provides theoretical justification: the goal of learning shared representations is equivalent to maximizing mutual information. This connects contrastive learning to information bottleneck theory and principled representation learning.
Question 3 Multiple Choice
Why do contrastive learning methods benefit from large batch sizes, even though larger batches typically provide less gradient noise regularization?
ALarger batches have no special advantage; batch size is irrelevant for contrastive learning
BLarger batches provide more negative examples, increasing the diversity of contrasts the model learns from
CLarger batches reduce variance, leading to cleaner representations
DBatch size only matters for the optimizer, not the contrastive objective itself
Contrastive learning benefits from large batches because each training example generates multiple negative pairs (from the same batch). Larger batches provide more negatives, increasing the diversity of contrasts and improving the quality of the learned representation. A batch size of 256 provides 255 negative examples per positive pair; a batch size of 4096 provides 4095 negatives. This abundance of negatives is a key advantage of contrastive methods and explains why they scale well with batch size and compute.
Question 4 True / False
The BYOL (Bootstrap Your Own Latent) algorithm achieves good performance using only positive pairs, with no explicit negative pairs. Does this contradict contrastive learning theory?
TTrue
FFalse
Answer: False
BYOL achieves competitive performance without explicit negative pairs, which challenges the classical contrastive view that negative pairs are essential. The explanation is that implicit negative pairs are provided by the diversity of the training set and the stop-gradient operation that prevents representation collapse. Additionally, BYOL implicitly performs a form of negative pairing through the interaction of the online and target networks. While BYOL operates within a broader framework than classical contrastive learning, its success shows that the specific form of negative pairs (explicit vs. implicit) is less critical than the overall principle of learning discriminative representations through comparison.