The optimal learning rate for Hedge is eta = sqrt(ln(N)/T), which minimizes the regret bound. Plugging in: eta = sqrt(4.6/10000) ≈ 0.021. The resulting regret is 2 * sqrt(T * ln(N)) = 2 * sqrt(10000 * 4.6) ≈ 2 * 214 ≈ 428. The O() hides constants, but the scaling is sqrt(T * ln N) ≈ 215 (times a constant). This means the learner's average loss per round exceeds the best expert's by only about 0.04 — essentially matching the best expert's performance over 10,000 rounds, despite not knowing which expert is best in advance.
Question 2 True / False
The Hedge algorithm requires knowing the time horizon T in advance to set the optimal learning rate. Without this knowledge, the algorithm cannot achieve sublinear regret.
TTrue
FFalse
Answer: False
While the optimal eta = sqrt(ln(N)/T) does require knowing T, the 'doubling trick' eliminates this requirement. The idea: run Hedge with eta optimized for T = 1, then restart with eta optimized for T = 2, then T = 4, 8, 16, and so on, doubling the horizon each time. The total regret across all epochs is at most a constant factor worse than knowing T in advance, still achieving O(sqrt(T * ln N)). Alternatively, a time-varying learning rate eta_t = sqrt(ln(N)/t) (decreasing with t) achieves the same bound without restarts. Sublinear regret is achievable without foreknowledge of the horizon.
Question 3 True / False
If all N experts have identical cumulative loss after T rounds, the Hedge algorithm's regret is zero.
TTrue
FFalse
Answer: True
Regret is defined as the learner's cumulative loss minus the best expert's cumulative loss. If all experts have the same cumulative loss L, the best expert also has loss L. The Hedge learner, who randomizes over experts, has expected cumulative loss equal to a weighted average of expert losses. Since all expert losses are equal, the learner's expected loss each round equals each expert's loss, giving cumulative loss L. Regret = L - L = 0. This makes intuitive sense: when all experts are equally good, there is nothing to learn, and any strategy — including uniform randomization — achieves zero regret.
Question 4 Short Answer
Explain why the regret bound for the expert problem scales with ln(N) rather than N, and what this means for practical applicability with large expert pools.
Think about your answer, then reveal below.
Model answer: The logarithmic dependence on N comes from the multiplicative weight structure. Each expert starts with weight 1/N, so the initial 'cost' of including an expert is ln(N) in the potential function analysis (the log of the total weight). As the algorithm runs, it only needs to distinguish the best expert from the rest — it does not need to individually estimate each expert's quality. The multiplicative update naturally concentrates weight on good experts exponentially fast, and the regret analysis tracks the ratio of the total weight to the best expert's weight, which involves ln(N) rather than N. Practically, this means Hedge scales extraordinarily well: with N = 1,000,000 experts, the regret only increases by a factor of sqrt(ln(10^6)/ln(10)) ≈ 2.4 compared to N = 10. You can include a massive pool of candidate strategies with minimal cost.
This logarithmic scaling is why the expert problem framework is practical for model selection: you can combine thousands of base models with Hedge-like algorithms, and the regret penalty for including a bad model is negligible. The cost of having more options is essentially free.