Simd optimizations and microarchitectures, Simd optimizations and microarchitectures -41 – Intel ARCHITECTURE IA-32 User Manual

Page 261

Advertising
background image

Optimizing for SIMD Integer Applications

4

4-41

aligned versions; this can reduce the performance gains when using
the 128-bit SIMD integer extensions. The general guidelines on the
alignment of memory operands are:

— The greatest performance gains can be achieved when all

memory streams are 16-byte aligned.

— Reasonable performance gains are possible if roughly half of all

memory streams are 16-byte aligned, and the other half are not.

— Little or no performance gain may result if all memory streams

are not aligned to 16-bytes; in this case, use of the 64-bit SIMD
integer instructions may be preferable.

Loop counters need to be updated because each 128-bit integer
instruction operates on twice the amount of data as the 64-bit integer
counterpart.

Extension of the

pshufw

instruction (shuffle word across 64-bit

integer operand) across a full 128-bit operand is emulated by a
combination of the following instructions:

pshufhw

,

pshuflw

,

pshufd

.

Use of the 64-bit shift by bit instructions (

psrlq

,

psllq

) are

extended to 128 bits in these ways:

— use of

psrlq

and

psllq

, along with masking logic operations

— code sequence is rewritten to use the

psrldq

and

pslldq

instructions (shift double quad-word operand by bytes).

SIMD Optimizations and Microarchitectures

Pentium M, Intel Core Solo and Intel Core Duo processors have a
different microarchitecture than Intel NetBurst

®

microarchitecture. The

following sections discuss optimizing SIMD code that targets Intel Core
Solo and Intel Core Duo processors.

On Intel Core Solo and Intel Core Duo processors, lddqu behaves
identically to movdqu by loading 16 bytes of data irrespective of
address alignment.

Advertising