Intel ARCHITECTURE IA-32 User Manual

Page 317

Advertising

Optimizing Cache Usage

6-27

This memory de-pipelining creates inefficiency in both the memory
pipeline and execution pipeline. This de-pipelining effect can be
removed by applying a technique called prefetch concatenation. With
this technique, the memory access and execution can be fully pipelined
and fully utilized.

For nested loops, memory de-pipelining could occur during the interval
between the last iteration of an inner loop and the next iteration of its
associated outer loop. Without paying special attention to prefetch
insertion, the loads from the first iteration of an inner loop can miss the
cache and stall the execution pipeline waiting for data returned, thus
degrading the performance.

In the code of Example 6-4, the cache line containing

a[ii][0]

is not

prefetched at all and always misses the cache. This assumes that no
array

a[][]

footprint resides in the cache. The penalty of memory

de-pipelining stalls can be amortized across the inner loop iterations.
However, it may become very harmful when the inner loop is short. In
addition, the last prefetch in the last PSD iterations are wasted and
consume machine resources. Prefetch concatenation is introduced here
in order to eliminate the performance issue of memory de-pipelining.

Advertising