Shannon entropy H(X) = -sum p(x) log p(x) quantifies the average uncertainty or "surprise" in a random variable X. It measures the minimum average number of bits needed to encode outcomes drawn from a distribution. A fair coin has entropy 1 bit; a biased coin has less. Entropy is maximized when all outcomes are equally likely and equals zero only when the outcome is certain. It is the foundational quantity of information theory, from which nearly all other measures are derived.
You know from probability that a random variable X has a distribution assigning probabilities to outcomes. Shannon's insight was to ask: how much "information" does observing an outcome of X provide? If an event has probability p, the surprise (or self-information) of seeing it is -log2(p) bits. A coin landing heads with probability 1/2 gives -log2(1/2) = 1 bit of surprise. An event with probability 1 gives zero surprise — you already knew it would happen. An event with probability 1/1024 gives 10 bits — it was deeply unexpected.
Shannon entropy is the expected surprise: H(X) = -sum over all x of p(x) * log2(p(x)). It averages the surprise across all possible outcomes, weighted by how often each occurs. For a fair coin, H = 1 bit. For a fair die, H = log2(6) ≈ 2.58 bits. For a degenerate distribution (one outcome certain), H = 0. The key formula uses the convention that 0 * log(0) = 0, which is justified by the limit as p approaches 0.
The operational meaning of entropy is precise: it is the minimum average number of bits per symbol needed to losslessly encode a long sequence of independent draws from the distribution. If a source has entropy 2 bits per symbol, no encoding scheme can compress the output to fewer than 2 bits per symbol on average (and there exist schemes, like Huffman or arithmetic coding, that get arbitrarily close). This is Shannon's source coding theorem, which gives entropy its concrete, engineering significance.
Entropy has several important properties. It is non-negative for discrete distributions. It is maximized by the uniform distribution (maximum ignorance). It is concave — mixtures of distributions have at least as much entropy as the average of their individual entropies. And it is additive for independent random variables: H(X, Y) = H(X) + H(Y) when X and Y are independent. These properties make entropy the natural measure of uncertainty, and all other information-theoretic quantities — joint entropy, conditional entropy, mutual information, KL divergence — are defined in terms of it.