AMD ATHLON 64 User Manual

Page 24

Advertising
background image

24

Analysis and Recommendations

Chapter 3

40555

Rev. 3.00

June 2006

Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™
ccNUMA Multiprocessor Systems

afterwords no longer needs the data structure and if only one of the worker threads needs the data
structure. In other words, the data structure is not truly shared between the worker threads.

It is best in this case to use a data initialization scheme that avoids incorrect data placement due to
first touch. This is done by allowing each worker thread to first touch its own data or by explicitly
pinning the data associated with each worker thread on the node where the worker thread runs.

Certain OSs provide memory placement tools and APIs that also permit data migration. A worker
thread can use these to migrate the data from the node where the start up thread did the first touch to
the node where the worker thread needs it. There is a cost associated with the migration and it would
be less efficient than using the correct data initialization scheme in the first place.

If it is not possible to modify the application to use a correct data initialization scheme or if data is
truly being shared by the various worker threads—as in a database application—then a technique
called node interleaving can be used to improve performance. Node interleaving allows for memory
to be interleaved across any subset of nodes in the multiprocessor system. When the node interleaving
policy is used, it overrides the default local allocation policy used by the OS on first touch.

Let us assume that the data structure shared between the worker threads in this case is of size 16 KB.
If the default policy of local allocation is used then the entire 16KB data structure resides on the node
where the startup thread does first touch. However, using the policy of node interleaving, the 16-KB
data structure can be interleaved on first touch such that the first 4KB ends up on node 0, the next
4KB ends up on node 1, and the next 4KB ends up on node 2 and so on. This assumes that there is
enough physical memory available on each node. Thus, instead of having all memory resident on a
single node and making that the bottleneck, memory is now spread out across all nodes.

The tools and APIs that support explicit thread and memory placement mentioned in the previous
sections can also be used by an application to use the node interleaving policy for its memory. For
additional details refer to Section A.8 on page 46.

By default, the granularity of interleaving offered by the tools/APIs is usually set to the size of the
virtual page supported by the hardware, which is 4K (when system is configured for normal pages,
which is the default) and 2M ((when system is configured for large pages,). Therefore any benefit
from node interleaving will only be obtained if the data being accessed is significantly larger than a
virtual page size.

If data is being accessed by three or more cores, then it is better to interleave data across the nodes
that access the data than to leave it resident on a single node. We anticipate that using this rule of
thumb could give a significant performance improvement. However, developers are advised to
experiment with their applications to measure any performance change.

A good example of of the use of node interleaving is observed with Spec JBB 2005 using Sun JVM
1.5.0_04-GA. Using node interleaving improved the peak throughput score reported by Spec JBB
2005 by 8%. We observe that, as this benchmark starts with a single thread and then ramps up to eight
threads, all threads end up accessing memory resident on a single node by the virtue of first touch.

Advertising