Example e-1 – Intel ARCHITECTURE IA-32 User Manual

Page 549

Advertising

Mathematics of Prefetch Scheduling Distance

E-3

data transfer latency which is equal to number of lines
per iteration * line burst latency

Note that the potential effects of µop reordering are not factored into the
estimations discussed.

Examine Example E-1 that uses the

prefetchnta

instruction with a

prefetch scheduling distance of 3, that is, psd = 3. The data prefetched in
iteration i, will actually be used in iteration i+3. T

represents the cycles

needed to execute

top_loop

- assuming all the memory accesses hit L1

while il (iteration latency) represents the cycles needed to execute this
loop with actually run-time memory footprint. T

can be determined by

computing the critical path latency of the code dependency graph. This
work is quite arduous without help from special performance
characterization tools or compilers. A simple heuristic for estimating the
T

value is to count the number of instructions in the critical path and

multiply the number with an artificial CPI. A reasonable CPI value
would be somewhere between 1.0 and 1.5 depending on the quality of
code scheduling.

Example E-1 Calculating Insertion for Scheduling Distance of 3

top_loop:

prefetchnta [edx+esi+32*3]

prefetchnta [edx*4+esi+32*3]

. . . . .

movaps xmm1, [edx+esi]

movaps xmm2, [edx*4+esi]

movaps xmm3, [edx+esi+16]

movaps xmm4, [edx*4+esi+16]

. . . . .

. . .

add esi, 32

cmp esi, ecx

jl top_loop

Advertising