Optimizing memory copy routines, Example 6-10, Basic algorithm of a simple memory copy -46 – Intel ARCHITECTURE IA-32 User Manual

Page 336

Advertising
background image

IA-32 Intel® Architecture Optimization

6-46

Later, the processor re-reads the data using

prefetchnta

, which ensures

maximum bandwidth, yet minimizes disturbance of other cached
temporal data by using the non-temporal (NTA) version of prefetch.

Conclusions from Video Encoder and Decoder
Implementation

These two examples indicate that by using an appropriate combination
of non-temporal prefetches and non-temporal stores, an application can
be designed to lessen the overhead of memory transactions by
preventing second-level cache pollution, keeping useful data in the
second-level cache and reducing costly write-back transactions. Even if
an application does not gain performance significantly from having data
ready from prefetches, it can improve from more efficient use of the
second-level cache and memory. Such design reduces the encoder’s
demand for such critical resource as the memory bus. This makes the
system more balanced, resulting in higher performance.

Optimizing Memory Copy Routines

Creating memory copy routines for large amounts of data is a common
task in software optimization.

Example 6-10 presents a basic algorithm for a the simple memory copy.
This task can be optimized using various coding techniques. One
technique uses software prefetch and streaming store instructions. It is
discussed in the following paragraph and a code example is shown in
Example 6-11.

Example 6-10 Basic Algorithm of a Simple Memory Copy

#define N 512000

double a[N], b[N];

for (i = 0; i < N; i++) {

b[i] = a[i];

}

Advertising