Supervised learning learns mappings from inputs to outputs using labeled data. Classification predicts discrete labels; regression predicts continuous values. Loss functions quantify prediction errors. The goal is minimizing training error while generalizing to unseen data, requiring careful bias-variance balancing.
If you have studied probability and basic algorithms, supervised learning is the natural next step toward building systems that learn from data. The core idea is simple: given many examples of (input, correct output) pairs, train a model to predict outputs for inputs it has never seen before. A spam filter trained on emails labeled "spam" or "not spam," a model predicting house prices from square footage and neighborhood — both are supervised learning.
The two main tasks are classification and regression. In classification, the output is a discrete category (spam/not spam, dog/cat/bird). In regression, the output is a continuous number (price, temperature, risk score). Despite different output types, the learning process is the same: choose a model family, define a loss function that measures prediction error, and adjust the model's parameters to minimize that loss on training examples.
The loss function is the mathematical heart of supervised learning. It converts the question "how wrong is my model?" into a single number the algorithm can minimize. For regression, mean squared error (average of squared differences between prediction and truth) is standard. For classification, cross-entropy loss is common — it penalizes confident wrong predictions severely. The algorithm (typically gradient descent or a variant) iteratively nudges the model's parameters in the direction that reduces the loss.
Here is the central tension every supervised learning practitioner must manage: a model that fits training data perfectly often fails on new data. This is overfitting — the model has learned the quirks and noise of its training examples rather than the underlying pattern. Conversely, a model too simple to capture real patterns underfits and performs poorly everywhere. This bias-variance tradeoff is not a problem to solve once and forget — it governs every modeling decision, from choosing model complexity to how much training data to use.
The standard discipline for managing this is splitting data into training, validation, and test sets. You train on the training set, tune choices (model complexity, regularization) using the validation set, and report final performance on the test set — which you touch only once. The test set performance is your honest estimate of how the model will behave in the real world.