Example 3-8, Simple four-iteration loop -14 – Intel ARCHITECTURE IA-32 User Manual

Page 194

Advertising
background image

IA-32 Intel® Architecture Optimization

3-14

The examples that follow illustrate the use of coding adjustments to
enable the algorithm to benefit from the SSE. The same techniques may
be used for single-precision floating-point, double-precision
floating-point, and integer data under SSE2, SSE, and MMX
technology.

As a basis for the usage model discussed in this section, consider a
simple loop shown in Example 3-8.

Note that the loop runs for only four iterations. This allows a simple
replacement of the code with Streaming SIMD Extensions.

For the optimal use of the Streaming SIMD Extensions that need data
alignment on the 16-byte boundary, all examples in this chapter assume
that the arrays passed to the routine,

a

,

b

,

c

, are aligned to 16-byte

boundaries by a calling routine. For the methods to ensure this
alignment, please refer to the application notes for the Pentium 4
processor.

The sections that follow provide details on the coding methodologies:
inlined assembly, intrinsics, C++ vector classes, and automatic
vectorization.

Example 3-8

Simple Four-Iteration Loop

void add(float *a, float *b, float *c)

{

int i;

for (i = 0; i < 4; i++) {

c[i] = a[i] + b[i];

}

}

Advertising