Differential entropy h(X) = -integral f(x) log f(x) dx extends Shannon entropy to continuous random variables by replacing sums with integrals and probabilities with densities. Unlike discrete entropy, differential entropy can be negative (a narrow Gaussian has h(X) < 0). It is NOT the limit of discrete entropy as the quantization becomes finer — that limit diverges. Despite this, differences of differential entropies are well-defined and match the corresponding discrete quantities: mutual information I(X;Y) = h(X) - h(X|Y) is always non-negative and finite. Differential entropy is essential for analyzing continuous channels, Gaussian sources, and rate-distortion theory.
Shannon entropy works perfectly for discrete random variables, but continuous variables require care. You might try directly substituting integrals for sums in the entropy formula, and indeed that gives differential entropy: h(X) = -integral f(x) log f(x) dx, where f(x) is the probability density function. This quantity is useful but has important differences from its discrete counterpart.
The most striking difference is that differential entropy can be negative. A Uniform(0, 1/2) random variable has h(X) = log2(1/2) = -1 bit. A very narrow Gaussian has large, positive density values, making -f(x) log f(x) negative over most of its support. This seems paradoxical until you realize what happened: densities can exceed 1 (unlike probabilities), so log f(x) can be positive, flipping the sign. The negativity reflects extreme concentration, not any pathology.
The deeper issue is that differential entropy is NOT the true continuous analog of discrete entropy. If you quantize X into bins of width delta, the discrete entropy is approximately h(X) + log(1/delta). As delta shrinks, the discrete entropy grows without bound — it takes infinitely many bits to specify a continuous value exactly. Differential entropy is what remains after subtracting this infinite offset. Consequently, h(X) depends on the coordinate system: scaling X by a constant a changes h(X) by log|a|, unlike discrete entropy which is invariant under permutations of the alphabet.
Despite these subtleties, differential entropy is extremely useful because differences of differential entropies are well-behaved. Mutual information I(X;Y) = h(X) - h(X|Y) is always non-negative, coordinate-invariant, and has the same operational interpretation as in the discrete case. The capacity of the Gaussian channel, C = (1/2) log(1 + P/N), is derived using differential entropy. Rate-distortion functions for continuous sources use differential entropy. The maximum-entropy property of the Gaussian (h_Gauss >= h_other for fixed variance) is proved using differential entropy. The rule of thumb: use differential entropy freely in calculations, but only trust differences of differential entropies for operational conclusions.