A loop body contains four independent array reads followed by arithmetic on each. After unrolling by a factor of 4, what benefit does the compiler gain beyond simply reducing branch count?
AThe compiler can eliminate three of the four reads through common subexpression elimination
BThe larger basic block lets the compiler schedule multiple independent operations across functional units simultaneously, exploiting instruction-level parallelism
CThe loop will execute in exactly one-fourth the wall-clock time due to branch elimination alone
DRegister pressure is reduced because fewer variables are live at any point
Branch reduction is real but often not the main benefit. The more important effect is that unrolling creates a larger basic block — a longer straight-line sequence with no branches. The compiler's instruction scheduler can now see more independent operations at once and fill the pipeline by interleaving them, issuing loads early to hide memory latency, and keeping multiple execution units busy simultaneously. On modern out-of-order processors with deep pipelines, this instruction-level parallelism (ILP) exposure is often where the real speedup comes from.
Question 2 Multiple Choice
A compiler unrolls a loop of 998 iterations by a factor of 4. Besides the main unrolled loop body, what else must the compiler generate?
ANothing extra — the compiler rounds to 996 iterations and discards the remainder silently
BA remainder loop (epilogue) of 2 iterations to handle the 2 leftover elements that don't fit into groups of 4
CAn additional runtime branch inside the unrolled body to detect when the iteration count has been reached
DA prologue that aligns the loop to a multiple of 4 before entering the main unrolled body
998 = 4 × 249 + 2, so 2 iterations remain after the main unrolled body finishes. The compiler must generate a remainder loop (also called an epilogue) that handles these 2 leftover iterations as single iterations. Silently discarding them would be a correctness bug. This bookkeeping is automatic in a compiler but adds complexity to the generated code and can partially offset the benefits of unrolling when the main loop is very short.
Question 3 True / False
Loop unrolling can sometimes decrease performance despite reducing branch count, because duplicating the loop body increases code size and may cause instruction cache pressure.
TTrue
FFalse
Answer: True
The instruction cache is finite. If the unrolled loop body is too large, it may evict other frequently used code from the cache, causing cache misses that cost more than the branch overhead that was eliminated. This is why compilers use heuristics to limit the unrolling factor — typically 2, 4, or 8 for tight loops — and avoid unrolling large loop bodies. The profitability of unrolling depends on the interaction between loop body size, the target machine's cache hierarchy, and how much ILP is actually exposed.
Question 4 True / False
Loop unrolling typically improves performance for any loop, regardless of loop body size or trip count, because eliminating branches typically saves more time than it costs.
TTrue
FFalse
Answer: False
Unrolling has real costs: increased code size, potential instruction cache pressure, and the overhead of generating and executing a remainder loop. For loops with large bodies, unrolling may push the code out of the instruction cache, causing fetch penalties worse than the eliminated branches. For loops with very short trip counts, unrolling may generate more epilogue code than main body code. Compilers apply heuristics precisely because profitability is context-dependent — there is no free lunch.
Question 5 Short Answer
Explain why loop unrolling can sometimes decrease performance rather than improve it, even though it reduces the number of branch instructions executed.
Think about your answer, then reveal below.
Model answer: Loop unrolling increases code size by duplicating the loop body multiple times. If the unrolled body no longer fits in the instruction cache, the processor must fetch instructions from slower levels of the memory hierarchy, incurring cache miss penalties. These fetch costs can outweigh the savings from fewer branch instructions. Additionally, aggressive unrolling can increase register pressure, potentially causing the compiler to spill variables to memory. The benefit of unrolling depends on the ratio of branch overhead to loop body work, and on whether the enlarged code fits in the instruction cache.
Understanding the tradeoff between branch reduction and cache pressure is what separates mechanical application of 'unrolling is good' from genuine understanding of when to apply it. The compiler must balance the instruction-scheduling benefit of larger basic blocks against the cache cost of larger code — which is why unrolling factors rarely exceed 8 and compilers profile or estimate cache effects before unrolling.