Chapter 4 conclusions, Chapter 4, Conclusions – AMD ATHLON 64 User Manual

Page 37

Advertising
background image

Chapter 4

Conclusions

37

Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™

ccNUMA Multiprocessor Systems

40555

Rev. 3.00

June 2006

Chapter 4

Conclusions

The single most important recommendation for most applications is to keep data local on node where
it is being accessed. As long as a thread initializes the data it needs, in other words writes to it for the
first time, a ccNUMA aware OS will typically keep the data local on the node where the thread runs.
By allowing local allocation on first touch policy for data placement, a ccNUMA aware OS makes the
task of data placement transparent and easy for most developers.

In some cases, if an application demonstrates symptoms that its threads are being moved away from
their data, it might be useful to explicitly pin the thread to a specific node. Several ccNUMA-aware
OSs offer tools and APIs to influence thread placement. Typically an OS scheduler uses load
balancing schemes to make decisions on where to place threads. Using these tools or APIs will
override the scheduler and hand over the control for thread placement to the developer. . The
developer should do reasonable scheduling with these tools or APIs by adhering to the following
guidelines:

When scheduling threads that mostly access independent data on an idle dual-core AMD
multiprocessor system, first schedule threads to an idle core of each node until all nodes are
exhausted and then move on to the other idle core of each node. In other words, schedule using
node major order first followed by core major order.

When scheduling multiple threads that mostly share data with each other on an idle dual-core
AMD multiprocessor system, schedule threads on both cores of an idle node first and then move
on to the next idle node and so on. In other words, schedule using core major order first followed
by node major order.

By default, a ccNUMA-aware OS uses the local allocation on first touch policy for data placement.
Normally practical, this policy can become suboptimal if a thread first touches data on one node that
it subsequently no longer needs, but which some other thread later accesses from a different node. It is
best to change the data initialization scheme so that a thread initializes the data it needs and does not
rely on any other thread to do the initialization. Several ccNUMA aware OSs offer tools and APIs to
influence data placement. Using the tools or API will override the default local allocation by first
touch policy and hand over the control for data placement to the developer.

If it is not possible to change the data initialization scheme or if the data is truly shared by threads
running on different nodes, then a technique called node interleaving of memory can be used. The use
of node interleaving on data is recommended when the data resides on a single node and is accessed
by three or more cores. The interleaving should be performed on the nodes from which the data is
accessed and should only be used when the data accessed is significantly larger than 4K (when the
system is configured for normal pages, which is the default) or 2M (when the system is configured for
large pages). Developers are advised to experiment with their applications to gauge the performance
change due to node interleaving. For additional details on the tools and APIs offered by various OS
for node interleaving, refer to Section A.8 on page 46.

Advertising