Joint entropy H(X,Y) measures the total uncertainty in a pair of random variables considered together. Conditional entropy H(Y|X) measures the remaining uncertainty in Y after observing X — how much new information Y provides beyond what X already told you. The chain rule H(X,Y) = H(X) + H(Y|X) decomposes joint uncertainty into what X reveals plus what remains. Conditioning never increases entropy on average: H(Y|X) <= H(Y), with equality only when X and Y are independent. These quantities form the algebraic backbone of information theory.
Shannon entropy measures the uncertainty in a single random variable. When you have two variables X and Y, you often want to know: how much total uncertainty is there, and how does knowing one reduce your uncertainty about the other? Joint and conditional entropy answer these questions precisely.
Joint entropy H(X,Y) = -sum over all (x,y) of p(x,y) log p(x,y) is simply Shannon entropy applied to the pair (X,Y) treated as a single random variable over the product space. It measures the total bits needed to describe both variables together. If X and Y are independent, H(X,Y) = H(X) + H(Y) — the total uncertainty is the sum of the individual uncertainties. If they are dependent, H(X,Y) < H(X) + H(Y) because some information is shared.
Conditional entropy H(Y|X) = sum over x of p(x) * H(Y|X=x) is the average remaining uncertainty in Y after learning X. For each specific value x, H(Y|X=x) measures the entropy of Y's conditional distribution given X=x; the conditional entropy averages this over all values of X. If X completely determines Y (like knowing a student's exam answers determines their score), then H(Y|X) = 0. If X tells you nothing about Y (independence), then H(Y|X) = H(Y).
The chain rule connects these: H(X,Y) = H(X) + H(Y|X). The total uncertainty in (X,Y) equals the uncertainty in X plus whatever uncertainty remains in Y after X is known. This can be chained: H(X,Y,Z) = H(X) + H(Y|X) + H(Z|X,Y). A fundamental inequality — often called "information never hurts" — states that H(Y|X) <= H(Y): on average, knowing more cannot increase your uncertainty. The gap H(Y) - H(Y|X) is the mutual information I(X;Y), which measures how much X tells you about Y. These three quantities — joint entropy, conditional entropy, and the chain rule — form the algebraic foundation on which the rest of information theory is built.