Derive the chain rule for entropy H(X,Y) = H(X) + H(Y|X) from the definition of joint and conditional entropy, and explain why the decomposition is asymmetric.
Think about your answer, then reveal below.
Model answer: Starting from H(X,Y) = -sum_{x,y} p(x,y) log p(x,y), use p(x,y) = p(x)*p(y|x). Then log p(x,y) = log p(x) + log p(y|x), so H(X,Y) = -sum_{x,y} p(x,y) log p(x) - sum_{x,y} p(x,y) log p(y|x). The first sum simplifies to H(X) (summing out y gives the marginal). The second sum is H(Y|X) by definition. The decomposition is asymmetric: H(X,Y) = H(X) + H(Y|X) = H(Y) + H(X|Y), but H(Y|X) != H(X|Y) in general. Knowing X may reduce uncertainty about Y by a different amount than knowing Y reduces uncertainty about X.
Both orderings are valid chain rules and give the same joint entropy. The asymmetry reflects a real phenomenon: in a teacher-student pair, knowing the teacher's grade assignment might almost determine the student's grade (low H(student|teacher)), but knowing the student's grade may leave substantial uncertainty about the teacher's specific rubric (high H(teacher|student)).