Avoid excessive software prefetches, Improve effective latency of cache misses – Intel ARCHITECTURE IA-32 User Manual

Page 382

Advertising
background image

IA-32 Intel® Architecture Optimization

7-36

Avoid Excessive Software Prefetches

Pentium 4 and Intel Xeon Processors have an automatic hardware
prefetcher. It can bring data and instructions into the unified
second-level cache based on prior reference patterns. In most situations,
the hardware prefetcher is likely to reduce system memory latency
without explicit intervention from software prefetches. It is also
preferable to adjust data access patterns in the code to take advantage of
the characteristics of the automatic hardware prefetcher to improve
locality or mask memory latency. Using software prefetch instructions
excessively or indiscriminately will inevitably cause performance
penalties. This is because excessively or indiscriminately using software
prefetch instructions wastes the command and data bandwidth of the
system bus.

Using software prefetches delays the hardware prefetcher from starting
to fetch data needed by the processor core. It also consumes critical
execution resources and can result in stalled execution. The guidelines
for using software prefetch instructions are described in Chapter 2. The
techniques of using automatic hardware prefetcher is discussed in
Chapter 6.

User/Source Coding Rule 28. (M impact, L generality) Avoid excessive use
of software prefetch instructions and allow automatic hardware prefetcher to
work. Excessive use of software prefetches can significantly and unnecessarily
increase bus utilization if used inappropriately.

Improve Effective Latency of Cache Misses

System memory access latency due to cache misses is affected by bus
traffic. This is because bus read requests must be arbitrated along with
other requests for bus transactions. Reducing the number of outstanding
bus transactions helps improve effective memory access latency.

One technique to improve effective latency of memory read transactions
is to use multiple overlapping bus reads to reduce the latency of sparse
reads. In situations where there is little locality of data or when memory
reads need to be arbitrated with other bus transactions, the effective

Advertising