Optimize instruction scheduling, Enable vectorization – Intel ARCHITECTURE IA-32 User Manual

Page 79

Advertising
background image

General Optimization Guidelines

2

2-7

Avoid longer latency instructions: integer multiplies and divides.
Replace them with alternate code sequences (e.g., use shifts instead
of multiplies).

Use the

lea

instruction and the full range of addressing modes to do

address calculation.

Some types of stores use more µops than others, try to use simpler
store variants and/or reduce the number of stores.

Avoid use of complex instructions that require more than 4 µops.

Avoid instructions that unnecessarily introduce dependence-related
stalls:

inc

and

dec

instructions, partial register operations (8/16-bit

operands).

Avoid use of

ah

,

bh

, and other higher 8-bits of the 16-bit registers,

because accessing them requires a shift operation internally.

Use

xor

and

pxor

instructions to clear registers and break

dependencies for integer operations; also use

xorps

and

xorpd

to

clear XMM registers for floating-point operations.

Use efficient approaches for performing comparisons.

Optimize Instruction Scheduling

Consider latencies and resource constraints.

Calculate store addresses as early as possible.

Enable Vectorization

Use the smallest possible data type. This enables more parallelism
with the use of a longer vector.

Arrange the nesting of loops so the innermost nesting level is free of
inter-iteration dependencies. It is especially important to avoid the
case where the store of data in an earlier iteration happens lexically
after the load of that data in a future iteration (called
lexically-backward dependence).

Advertising