Example 6-11, A memory copy routine using software prefetch -48 – Intel ARCHITECTURE IA-32 User Manual

Page 338

Advertising
background image

IA-32 Intel® Architecture Optimization

6-48

Using the 8-byte Streaming Stores and Software Prefetch

Example 6-11 presents the copy algorithm that uses second level cache.
The algorithm performs the following steps:

1.

Uses blocking technique to transfer 8-byte data from memory into
second-level cache using the

_mm_prefetch

intrinsic, 128 bytes at

a time to fill a block. The size of a block should be less than one
half of the size of the second-level cache, but large enough to
amortize the cost of the loop.

2.

Loads the data into an

xmm

register using the

_mm_load_ps

intrinsic.

3.

Transfers the 8-byte data to a different memory location via the

_mm_stream

intrinsics, bypassing the cache. For this operation, it is

important to ensure that the page table entry prefetched for the
memory is preloaded in the TLB.

Example 6-11 A Memory Copy Routine Using Software Prefetch

#define PAGESIZE 4096;

#define NUMPERPAGE 512

// # of elements to fit a page

double a[N], b[N], temp;

for (kk=0; kk<N; kk+=NUMPERPAGE) {

temp = a[kk+NUMPERPAGE];

// TLB priming

// use block size = page size,

// prefetch entire block, one cache line per loop

for (j=kk+16; j<kk+NUMPERPAGE; j+=16) {

_mm_prefetch((char*)&a[j], _MM_HINT_NTA);

}

continued

Advertising