Use of cvttps2pi/cvttss2si instructions, Use of cvttps2pi/cvttss2si instructions -21, Example 5-10 – Intel ARCHITECTURE IA-32 User Manual

Page 283: And example 5-10, Register should be

Advertising
background image

Optimizing for SIMD Floating-point Applications

5

5-21

Use of cvttps2pi/cvttss2si Instructions

The

cvttps2pi

and

cvttss2si

instructions encode the truncate/chop

rounding mode implicitly in the instruction, thereby taking precedence
over the rounding mode specified in the

MXCSR

register. This behavior

can eliminate the need to change the rounding mode from
round-nearest, to truncate/chop, and then back to round-nearest to
resume computation. Frequent changes to the

MXCSR

register should be

Example 5-10 Horizontal Add Using Intrinsics with movhlps/movlhps

void horiz_add_intrin(Vertex_soa *in, float *out)

{

__m128 v1, v2, v3, v4;

__m128 tmm0,tmm1,tmm2,tmm3,tmm4,tmm5,tmm6;

// Temporary variables

tmm0 = _mm_load_ps(in->x);

// tmm0 = A1 A2 A3 A4

tmm1 = _mm_load_ps(in->y);

// tmm1 = B1 B2 B3 B4

tmm2 = _mm_load_ps(in->z);

// tmm2 = C1 C2 C3 C4

tmm3 = _mm_load_ps(in->w);

// tmm3 = D1 D2 D3 D4

tmm5 = tmm0;

// tmm0 = A1 A2 A3 A4

tmm5 = _mm_movelh_ps(tmm5, tmm1);

// tmm5 = A1 A2 B1 B2

tmm1 = _mm_movehl_ps(tmm1, tmm0);

// tmm1 = A3 A4 B3 B4

tmm5 = _mm_add_ps(tmm5, tmm1);

// tmm5 = A1+A3 A2+A4 B1+B3 B2+B4

tmm4 = tmm2;

tmm2 = _mm_movelh_ps(tmm2, tmm3);

// tmm2 = C1 C2 D1 D2

tmm3 = _mm_movehl_ps(tmm3, tmm4);

// tmm3 = C3 C4 D3 D4

tmm3 = _mm_add_ps(tmm3, tmm2);

// tmm3 = C1+C3 C2+C4 D1+D3 D2+D4

tmm6 = tmm3;

// tmm6 = C1+C3 C2+C4 D1+D3 D2+D4

tmm6 = _mm_shuffle_ps(tmm3, tmm5, 0xDD);

// tmm6 = A1+A3 B1+B3 C1+C3 D1+D3

tmm5 = _mm_shuffle_ps(tmm5, tmm6, 0x88);

// tmm5 = A2+A4 B2+B4 C2+C4 D2+D4

tmm6 = _mm_add_ps(tmm6, tmm5);

// tmm6 = A1+A2+A3+A4 B1+B2+B3+B4

// C1+C2+C3+C4 D1+D2+D3+D4

_mm_store_ps(out, tmm6);

}

Advertising