A family of distributions {f(x|θ)} belongs to the exponential family if it has the form f(x|θ) = h(x) exp{Σⱼ ηⱼ(θ)Tⱼ(x) - A(θ)}, where A(θ) is the log-partition function. Examples include normal, binomial, Poisson, and exponential. The exponential family is mathematically convenient: sufficient statistics are easy to identify, conjugate priors exist, and maximum likelihood estimators often have closed forms.
You've studied distributions — normal, binomial, Poisson, exponential — and each seemed to come with its own density formula, its own moment calculations, and its own estimation methods. The exponential family is the observation that most of these distributions share one underlying mathematical structure, and that shared structure is precisely what makes them analytically tractable rather than a coincidence of convenient formulas.
A distribution belongs to the exponential family if its density (or probability mass function) can be written as f(x|θ) = h(x) exp{η(θ)·T(x) − A(θ)}. The function T(x) is the sufficient statistic — it captures everything the data can tell you about θ. The function η(θ) is the natural parameter, which becomes the primary parameter when you write the family in canonical form. The term h(x) depends only on the data (not on θ), and A(θ) is the log-partition function, a normalizing term that ensures the density integrates to 1. As an example: the Poisson distribution has h(x) = 1/x!, η(λ) = log(λ), T(x) = x, and A(λ) = λ.
The log-partition function A is where the power of the framework becomes concrete. From your MLE background, estimating parameters requires computing expectations and derivatives of the log-likelihood. For exponential family members, these calculations reduce to derivatives of A alone: E[T(X)] = A'(η) and Var[T(X)] = A''(η). This means the mean and variance of the sufficient statistic — which characterize the distribution — can be read off one function, without performing separate integrals for each distribution. The Gaussian, Bernoulli, Poisson, and Gamma all yield their moments through this single formula with different A's.
This structure also explains the existence of conjugate priors in Bayesian inference. If the prior on η has the form π(η) ∝ exp{χ·η − ν·A(η)}, then after observing n data points with sufficient statistics T(x₁), …, T(xₙ), the posterior has the same functional form with updated hyperparameters χ + ΣT(xᵢ) and ν + n. Bayesian updating reduces to adding the observed sufficient statistics to the prior — no integration required. This conjugate structure is not a lucky coincidence; it is a direct consequence of the exponential family form, and it is precisely why distributions from this family appear so frequently in probabilistic models and Bayesian statistics.