Hardware prefetching and cache blocking techniques – Intel ARCHITECTURE IA-32 User Manual

Page 329

Advertising

Optimizing Cache Usage

6-39

Table 6-1 summarizes the steps of the basic usage model that
incorporates only software prefetch with strip-mining. The steps are:

•

Do strip-mining: partition loops so that the dataset fits into
second-level cache.

•

Use

prefetchnta

if the data is only used once or the dataset fits

into 32K (one way of second-level cache). Use

prefetcht0

if the

dataset exceeds 32K.

The above steps are platform-specific and provide an implementation
example. The variables

NUM_STRIP

and

MAX_NUM_VX_PER_STRIP

can be

heuristically determined for peak performance for specific application
on a specific platform.

Hardware Prefetching and Cache Blocking Techniques

Tuning data access patterns for the automatic hardware prefetch
mechanism can minimize the memory access costs of the first-pass of
the read-multiple-times and some of the read-once memory references.
An example of the situations of read-once memory references can be
illustrated with a matrix or image transpose, reading from a column-first
orientation and writing to a row-first orientation, or vice versa.

Example 6-9 shows a nested loop of data movement that represents a
typical matrix/image transpose problem. If the dimension of the array
are large, not only the footprint of the dataset will exceed the last level
cache but cache misses will occur at large strides. If the dimensions

Table 6-1

Software Prefetching Considerations into Strip-mining Code

Read-Once Array
References

Read-Multiple-Times Array References

Adjacent Passes

Non-Adjacent Passes

Prefetchnta

Prefetch0, SM1

(2nd Level Pollution)

Evict one way; Minimize
pollution

Pay memory access cost for the
first pass of each array;
Amortize the first pass with
subsequent passes

Pay memory access cost for
the first pass of every strip;
Amortize the first pass with
subsequent passes

Advertising