Simd optimizations and microarchitectures, Packed floating-point performance, Simd optimizations and microarchitectures -27 – Intel ARCHITECTURE IA-32 User Manual

Page 289: Packed floating-point performance -27

Advertising

Optimizing for SIMD Floating-point Applications

5-27

SIMD Optimizations and Microarchitectures

Pentium M, Intel Core Solo and Intel Core Duo processors have a
different microarchitecture than Intel NetBurst

microarchitecture. The

following sub-section discusses optimizing SIMD code that target Intel
Core Solo and Intel Core Duo processors.

Packed Floating-Point Performance

Most packed SIMD floating-point code will speed up on Intel Core Solo
processors relative to Pentium M processors. This is due to
improvement in decoding packed SIMD instructions.

The improvement of packed floating-point performance on the Intel
Core Solo processor over Pentium M processor depends on several
factors. Generally, code that is decoder-bound and/or has a mixture of
integer and packed floating-point instructions can expect significant
gain. Code that is limited by execution latency and has a “cycles per
instructions” ratio greater than one will not benefit from decoder
improvement.

movaps xmm0, Vector1 ; the destination has a3, a2, a1, a0

movaps xmm1, Vector2 ; the destination has b3, b2, b1, b0

movaps xmm2, Vector3 ; the destination has c3, c2, c1, c0

movaps xmm3, Vector4 ; the destination has d3, d2, d1, d0

mulps xmm0, xmm1 ; a3b3, a2b2, a1b1, a0b0

mulps xmm2, xmm3 ; c3d3, c2d2, c1d1, c0d0

haddps xmm0, xmm2 ; the destination has c3d3+c2d2,

; c1d1+c0d0,a3b3+a2b2,a1b1+a0b0

haddps xmm0, xmm0 ; the destination has

; c3d3+c2d2+c1d1+c0d0,a3b3+a2b2+a1b1+a0b0,

; c3d3+c2d2+c1d1+c0d0,a3b3+a2b2+a1b1+a0b0

Example 5-13 Calculating Dot Products from AOS (continued)

Advertising