Intel ARCHITECTURE IA-32 User Manual

Page 131

Advertising
background image

General Optimization Guidelines

2

2-59

to early out). However, be careful of introducing more than a total of two
values for the floating point control word, or there will be a large performance
penalty. See “Floating-point Modes”.

User/Source Coding Rule 13. (H impact, ML generality) Use fast
float-to-int routines, FISTTP, or SSE2 instructions. If coding these routines, use
the

fisttp

instruction if SSE3 is available or

cvttss2si, cvttsd2si

instructions if coding with Streaming SIMD Extensions 2.

Many libraries do more work than is necessary. The FISTTP instruction
in SSE3 can convert floating-point values to 16-bit, 32-bit or 64-bit
integers using truncation without accessing the floating-point control
word (FCW). The instructions

cvttss2si/cvttsd2si

save many µops

and some store-forwarding delays over some compiler implementations.
This avoids changing the rounding mode.

User/Source Coding Rule 14. (M impact, ML generality) Break dependence
chains where possible.

Removing data dependence enables the out of order engine to extract
more ILP from the code. When summing up the elements of an array,
use partial sums instead of a single accumulator. For example, to
calculate

z = a + b + c + d

, instead of:

x = a + b;

y = x + c;

z = y + d;

use:

x = a + b;

y = c + d;

z = x + y;

User/Source Coding Rule 15. (M impact, ML generality) Usually, math
libraries take advantage of the transcendental instructions (for example,

fsin

) when evaluating elementary functions. If there is no critical need to

evaluate the transcendental functions using the extended precision of 80 bits,
applications should consider alternate, software-based approach, such as
look-up-table-based algorithm using interpolation techniques. It is possible to
improve transcendental performance with these techniques by choosing the

Advertising