A convex optimization problem minimizes a convex function over a convex set. Convexity guarantees that every local minimum is a global minimum — there are no suboptimal traps. This structural property makes convex problems fundamentally tractable: gradient descent and its variants are guaranteed to find the global optimum, and strong duality often holds, providing both an alternative solution method and optimality certificates. Most classical ML loss functions (linear regression, logistic regression, SVMs) are convex, and understanding convexity is essential for knowing when optimization is "easy" and when the non-convexity of deep learning is a genuine theoretical challenge.
Convex optimization occupies a privileged position in machine learning: it is the largest class of optimization problems for which we have complete, efficient solutions. Understanding convexity explains why some ML problems (linear regression, SVMs, logistic regression) come with strong theoretical guarantees while others (deep learning) remain theoretically mysterious.
A set S is convex if the line segment between any two points in S lies entirely within S. A function f is convex if its epigraph (the set of points above its graph) is a convex set, equivalently if f(lambda*x + (1-lambda)*y) <= lambda*f(x) + (1-lambda)*f(y). The fundamental consequence is that any local minimum of a convex function over a convex set is a global minimum. There are no ridges, valleys, or saddle points that could trap a descent algorithm — every downhill direction leads toward the global optimum. This geometric simplicity translates directly into algorithmic guarantees.
Gradient descent on a smooth convex function converges at rate O(1/T). Nesterov's accelerated gradient descent achieves O(1/T^2) — provably the fastest rate achievable by first-order methods (methods that use only gradient information). For strongly convex functions, gradient descent converges exponentially: O(exp(-T * mu/L)), where mu is the strong convexity parameter and L is the smoothness parameter. These are not empirical observations but proven theorems, with matching lower bounds showing no first-order method can do better. The duality theory adds another dimension: every convex optimization problem has a dual problem whose optimal value provides a lower bound on the primal optimal value, and under mild conditions (Slater's constraint qualification), the two values are equal. This strong duality enables algorithms that solve the dual (often simpler) problem instead.
For machine learning, convexity is the boundary between well-understood and frontier. Regularized empirical risk minimization with convex losses (squared loss, logistic loss, hinge loss) and convex regularizers (L1, L2) is a convex problem — global convergence is guaranteed, and the theoretical analysis of these methods is essentially complete. Deep learning uses non-convex losses (the composition of nonlinear activation functions creates a non-convex landscape), and the theory cannot guarantee finding global optima. The ongoing effort to understand why SGD succeeds on non-convex deep learning landscapes — through concepts like loss landscape flatness, implicit regularization, and over-parameterization — represents one of the most active areas in ML theory, and convex optimization theory provides both the tools and the benchmark against which progress is measured.