Intel ARCHITECTURE IA-32 User Manual

Page 322

Advertising
background image

IA-32 Intel® Architecture Optimization

6-32

Mix Software Prefetch with Computation Instructions

It may seem convenient to cluster all of the prefetch instructions at the
beginning of a loop body or before a loop, but this can lead to severe
performance degradation. In order to achieve best possible performance,
prefetch instructions must be interspersed with other computational
instructions in the instruction sequence rather than clustered together. If
possible, they should also be placed apart from loads. This improves the
instruction level parallelism and reduces the potential instruction
resource stalls. In addition, this mixing reduces the pressure on the
memory access resources and in turn reduces the possibility of the
prefetch retiring without fetching data.

Example 6-6 illustrates distributing prefetch instructions. A simple and
useful heuristic of prefetch spreading for a Pentium 4 processor is to
insert a prefetch instruction every 20 to 25 clocks. Rearranging prefetch
instructions could yield a noticeable speedup for the code which stresses
the cache resource.

Advertising