Contrastive learning is a self-supervised framework that learns representations by bringing similar examples close in embedding space while pushing dissimilar examples apart. The theory is grounded in mutual information maximization: the learned representation should maximize mutual information with similar examples (positive pairs) while minimizing it with dissimilar ones (negative pairs). Applications include SimCLR, MoCo, and BYOL, which achieve strong performance on downstream tasks by learning from unlabeled data. Contrastive learning theory provides a principled approach to representation learning without labels, with connections to information theory, metric learning, and noise-contrastive estimation.
Contrastive learning provides a powerful framework for self-supervised representation learning without labels. The core idea is elegant: define similarity through augmentation (two augmentations of the same image are similar; augmentations of different images are dissimilar) and train a model to embed similar examples close together while pushing dissimilar ones apart.
Theoretically, contrastive learning is grounded in noise-contrastive estimation (NCE) and mutual information maximization. The NCE framework, introduced by Gutmann and Hyvärinen, shows that maximizing a contrastive objective (distinguishing positive from negative examples) is equivalent to maximizing a lower bound on mutual information. Specifically, for a positive pair (x, x+) from the same example with different augmentations, maximizing I(z; z+) (mutual information between embeddings) prevents the representation from discarding task-relevant information.
The typical contrastive loss is NT-Xent (normalized temperature-scaled cross-entropy):
L = -log( exp(sim(z_i, z_j+) / tau) / sum_k exp(sim(z_i, z_k) / tau) )
where z_i and z_j+ are embeddings of a positive pair, z_k ranges over negatives, sim is cosine similarity, and tau is temperature. This loss can be interpreted as: given a positive pair (i, j+) and many negatives, correctly identify the positive in a multinomial classification task. Minimizing this loss pushes positive pairs close together (high numerator) while pulling negatives far apart (low denominator).
The information-theoretic interpretation is critical: by maximizing I(z_i; z_j+), the representation z retains all information that is invariant across augmentations (true shared structure) and discards information that is specific to one augmentation (noise). This is precisely what you want in a representation: shared, generalizable structure. The mutual information view also connects to information bottleneck theory: the representation should be maximally informative about the invariant structure while being minimally informative about the augmentation-specific details.
Practical algorithms exploit this theory. SimCLR (Simple Contrastive Learning of Representations) learns from unlabeled images by: (1) applying two independent augmentations to each image, (2) encoding both augmentations with a CNN, (3) projecting the embeddings to a high-dimensional space, (4) minimizing NT-Xent loss between the two encodings, treating them as positive pair. The learned representations, when used as initialization for downstream tasks, achieve competitive performance with supervised learning.
The role of negatives is crucial in classical contrastive theory. The denominator of NT-Xent includes all negative pairs (different images in the batch). Larger batches provide more negatives, improving the quality of the contrastive gradient. This explains why contrastive methods scale well with batch size: more negatives = better contrasts = better representations. It also explains why maintaining a memory bank of past embeddings (as in MoCo) improves performance: it increases the pool of available negatives without increasing batch size.
Scaling properties: The number of negatives required to learn good representations scales roughly logarithmically with dimensionality and task difficulty. This means contrastive learning is more efficient than alternatives in high-dimensional spaces and scales well to large models and datasets.
Variants and refinements extend the theory. SwAV uses clustering instead of instance discrimination. BYOL omits explicit negatives, relying on implicit contrast through network momentum and stop-gradient operations. SimSiam removes the memory bank requirement through redundancy reduction. These variants all maintain the core principle: learn representations by comparing similar and dissimilar examples, with implicit or explicit negative pairs.
Limitations: Contrastive learning requires careful hyperparameter tuning (batch size, temperature, projection dimension, augmentation strength). The method is sensitive to the definition of "positive" (augmentations, which must be chosen carefully). Additionally, contrastive learning may encode task-irrelevant invariances (two images of the same object in different poses are positive, even if downstream tasks care about pose). Finally, the method requires substantial compute (large batches, long training) to match supervised baseline performance, offsetting some efficiency gains from avoiding labels.
Contrastive learning's success in vision and emerging applications in language demonstrate that self-supervised learning at scale is feasible, with implications for leveraging unlabeled data and learning general-purpose representations.
No topics depend on this one yet.