Latency and throughput with memory operands – Intel ARCHITECTURE IA-32 User Manual

Page 534

Advertising

IA-32 Intel® Architecture Optimization

C-20

Latency and Throughput of transcendental instructions can vary
substantially in a dynamic execution environment. Only an
approximate value or a range of values are given for these
instructions.

The FXCH instruction has 0 latency in code sequences. However, it
is limited to an issue rate of one instruction per clock cycle.

The load constant instructions, FINCSTP, and FDECSTP have 0
latency in code sequences.

Selection of conditional jump instructions should be based on the
recommendation of section “Branch Prediction” to improve the
predictability of branches. When branches are predicted
successfully, the latency of jcc is effectively zero.

RCL/RCR with shift count of 1 are optimized. Using RCL/RCR
with shift count other than 1 will be executed more slowly. This
applies to the Pentium 4 and Intel Xeon processors.

Latency and Throughput with Memory Operands

The discussion of this section applies to the Intel Pentium 4 and Intel
Xeon processors. Typically, instructions with a memory address as the
source operand, add one more

μop to the “reg, reg” instructions type

listed in Table C-1 through C-7. However, the throughput in most cases
remains the same because the load operation utilizes port 2 without
affecting port 0 or port 1.

Many IA-32 instructions accept a memory address as either the source
operand or as the destination operand. The former is commonly referred
to as a load operation, while the latter a store operation.

The latency for IA-32 instructions that perform either a load or a store
operation are typically longer than the latency of corresponding
register-to-register type of the IA-32 instructions. This is because load
or store operations require access to the cache hierarchy and, in some
cases, the memory sub-system.

Advertising