AMD ATHLON 64 User Manual

Page 38

Advertising
background image

38

Conclusions

Chapter 4

40555

Rev. 3.00

June 2006

Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™
ccNUMA Multiprocessor Systems

Data placement tools can also come in handy when a thread needs more data than the amount of
physical memory available on a node. Certain OSs also allow data migration with these tools or API.
Using this feature, data can be migrated from the node where it was first touched to the node where it
is subsequently accessed. There is a cost associated with this migration and it is not advised to use it
frequently. For additional details on the tools and APIs offered by various OS for thread and memory
placement refer to Section A.7 on page 44.

It is recommended to avoid sharing of data resident within a single cache line between threads
running on different cores.

Advanced developers may also run into interesting cases when experimenting with the thread and
data placement tools and APIs. Sometimes, when comparing workloads that are symmetrical in all
respects except for the thread and data placement used, the expected symmetry may be obscured.
These cases can mostly be explained by understanding the underlying system and avoiding saturation
of resources due to an imbalanced load.

The buffer queues constitute one such resource. The lengths of these queues are configured by the
BIOS with some hardware-specific limits that are specified in the BIOS Kernel and Developers Guide
for the particular processor. Following AMD recommendations, the BIOS allocates these buffers on a
link-by-link basis to optimize for the most common workloads.

In general, certain pathological access patterns should be avoided: several nodes trying to access data
on one node or the crossfire scenario can saturate underlying resources such as the HyperTransport™
link bandwidth and HyperTransport buffer queues and should be avoided when possible. AMD makes
event profiling tools available that developers can use to analyze whether their application is
demonstrating such behavior.

AMD very strongly recommends keeping user-level and kernel-level locks aligned to their natural
boundaries.

Some compilers for AMD multiprocessor systems provide additional hooks to allow for automatic
parallelization of otherwise serial programs. There is also support for extensions to the OpenMP
directives that can be used by OpenMP programs to improve performance.

While all the previous conclusions are stated in the context of threads, they can also be applied to
processes.

Advertising