Software prefetch concatenation, Software prefetch concatenation -26, Example 6-3 – Intel ARCHITECTURE IA-32 User Manual

Page 316: Prefetch scheduling distance -26

Advertising

IA-32 Intel® Architecture Optimization

6-26

lines of data per iteration. The PSD would need to be
increased/decreased if more/less than two cache lines are used per
iteration.

Software Prefetch Concatenation

Maximum performance can be achieved when execution pipeline is at
maximum throughput, without incurring any memory latency penalties.
This can be achieved by prefetching data to be used in successive
iterations in a loop. De-pipelining memory generates bubbles in the
execution pipeline. To explain this performance issue, a 3D geometry
pipeline that processes 3D vertices in strip format is used as an example.
A strip contains a list of vertices whose predefined vertex order forms
contiguous triangles. It can be easily observed that the memory pipe is
de-pipelined on the strip boundary due to ineffective prefetch
arrangement. The execution pipeline is stalled for the first two iterations
for each strip. As a result, the average latency for completing an
iteration will be 165(FIX) clocks. (See Appendix E, “Mathematics of
Prefetch Scheduling Distance”, for a detailed memory pipeline
description.)

Example 6-3

Prefetch Scheduling Distance

top_loop:

prefetchnta [edx + esi + 128*3]

prefetchnta [edx*4 + esi + 128*3]

. . . . .

movaps xmm1, [edx + esi]

movaps xmm2, [edx*4 + esi]

movaps xmm3, [edx + esi + 16]

movaps xmm4, [edx*4 + esi + 16]

. . . . .

add esi,

128

cmp esi,

ecx

jl top_loop

Advertising