Example 4-29 – Intel ARCHITECTURE IA-32 User Manual

Page 258

Advertising

IA-32 Intel® Architecture Optimization

4-38

SSE3 provides an instruction LDDQU for loading from memory
address that are not 16 byte aligned. LDDQU is a special 128-bit
unaligned load designed to avoid cache line splits. If the address of the
load is aligned on a 16-byte boundary, LDQQU loads the 16 bytes
requested. If the address of the load is not aligned on a 16-byte
boundary, LDDQU loads a 32-byte block starting at the 16-byte aligned
address immediately below the address of the load request. It then
provides the requested 16 bytes. If the address is aligned on a 16-byte
boundary, the effective number of memory requests is implementation
dependent (one, or more).

LDDQU is designed for programming usage of loading data from
memory without storing modified data back to the same address. Thus,
the usage of LDDQU should be restricted to situations where no
store-to-load forwarding is expected. For situations where store-to-load
forwarding is expected, use regular store/load pairs (either aligned or
unaligned based on the alignment of the data accessed).

Example 4-29 Video Processing Using LDDQU to Avoid Cache Line Splits

// Average half-pels horizonally (on // the “x” axis),

// from one reference frame only.

nextLinesLoop:

lddqu xmm0, XMMWORD PTR [edx] // may not be 16B aligned

lddqu xmm0, XMMWORD PTR [edx+1]

lddqu xmm1, XMMWORD PTR [edx+eax]

lddqu xmm1, XMMWORD PTR [edx+eax+1]

pavgbxmm0, xmm1

pavgbxmm2, xmm3

movdqaXMMWORD PTR [ecx], xmm0 //results stored elsewhere

movdqaXMMWORD PTR [ecx+eax], xmm2

// (repeat ...)

Advertising