Minimize number of software prefetches, Minimize number of software prefetches -29, Figure 6-4 – Intel ARCHITECTURE IA-32 User Manual

Page 319: Prefetch and loop unrolling -29

Advertising
background image

Optimizing Cache Usage

6

6-29

Minimize Number of Software Prefetches

Prefetch instructions are not completely free in terms of bus cycles,
machine cycles and resources, even though they require minimal clocks
and memory bandwidth.

Excessive prefetching may lead to performance penalties because issue
penalties in the front-end of the machine and/or resource contention in
the memory sub-system. This effect may be severe in cases where the
target loops are small and/or cases where the target loop is issue-bound.

One approach to solve the excessive prefetching issue is to unroll and/or
software-pipeline the loops to reduce the number of prefetches required.
Figure 6-4 presents a code example which implements prefetch and
unrolls the loop to remove the redundant prefetch instructions whose
prefetch addresses hit the previously issued prefetch instructions. In this
particular example, unrolling the original loop once saves six prefetch
instructions and nine instructions for conditional jumps in every other
iteration.

Figure 6-4

Prefetch and Loop Unrolling

OM15172

top_loop:
prefetchnta [edx+esi+32]
prefetchnta [edx*4+esi+32]
. . . . .
m ovaps xm m 1, [edx+esi]
m ovaps xm m 2, [edx*4+esi]
. . . . .
add esi, 16
cm p esi, ecx
jl top_loop

top_loop:
prefetchnta [edx+esi+128]
prefetchnta [edx*4+esi+128]
. . . . .
m ovaps xm m 1, [edx+esi]
m ovaps xm m 2, [edx*4+esi]
. . . . .
m ovaps xm m 1, [edx+esi+16]
m ovaps xm m 2, [edx*4+esi+16]
. . . . .
m ovaps xm m 1, [edx+esi+96]
m ovaps xm m 2, [edx*4+esi+96]
. . . . .
. . . . .
add esi, 128
cm p esi, ecx
jl top_loop

unrolled
iteration

Advertising