2 monitoring amd gpus – HP Insight Cluster Management Utility User Manual

Page 106

Advertising

Running /opt/cmu/bin/cmu_config_nvidia adds a list of predefined GPU metrics to
ActionAndAlertsFile.txt

. To monitor these metrics using the GUI, select the desired metrics

from the Monitoring sensors list as described in

Figure 32 (page 88)

NOTE:

Not all metrics are supported by all NVIDIA GPUs and some lesser used metrics may be

commented out within ActionAndAlertsFile.txt. To introduce/remove metrics from the
Monitoring sensors list, you can uncomment/comment out the associated lines inside
ActionAndAlertsFile.txt

as described in

“Action and alert files” (page 96)

NOTE:

HP Insight CMU dynamically determines if a client has working GPUs when monitoring

is initially started after installation on the client. This monitoring process allows for configurations
that have clients with GPUs and clients without GPUs. If the GPUs are not working when monitoring
is started (or GPUs are added at a later date), redeploy monitoring to the client (see

“Installing the

HP Insight CMU monitoring client” (page 85)

) and restart monitoring to ensure the GPUs are

recognized.

6.5.7.2 Monitoring AMD GPUs

If your client nodes contain AMD GPUs and are running version 8.83.5 or newer of the AMD GPU
driver, you can monitor your GPUs with HP Insight CMU.

If you haven’t done so already, install the AMD GPU driver version 8.83.5 or newer on your client
nodes. This can be done two ways:

Install the AMD GPU driver manually on one of the client nodes, backing up the client image,
and cloning the remaining clients with this new image.

Use the script /opt/cmu/contrib/cmu_install_amd to install the AMD GPU driver on
all running clients. For details, see the file /opt/cmu/contrib/
cmu_install_amd.README

To enable GPU monitoring, the /opt/cmu/etc/ActionAndAlertsFile.txt file must be
updated with entries for HP Insight CMU GPU monitoring. This is done by running the script /opt/
cmu/bin/cmu_config_amd

. This script takes the number of GPUs on each client as an argument.

The following example updates ActionAndAlertsFile.txt to monitor clients that have 2
GPUs each. Monitoring must be restarted for the updates to take effect.

# cmu_config_amd 2
You are about to update the CMU ActionsAndAlerts file with metrics for monitoring AMD GPUs.
Continue? [y/n] y
Configuring GPU monitoring in CMU...
GPU monitoring configured successfully.
Copy of orignial /opt/cmu/etc/ActionAndAlertsFile.txt can found in
/opt/cmu/etc/ActionAndAlertsFile.txt_before_cmu_config_amd_config
Please restart CMU ('/etc/init.d/cmu restart') to enable these changes.
# /etc/init.d/cmu restart
.
.

Running /opt/cmu/bin/cmu_config_amd adds a list of predefined GPU metrics to
ActionAndAlertsFile.txt

. To monitor these metrics using the GUI, select the desired metrics

from the Monitoring sensors list as described in

Figure 32 (page 88)

NOTE:

Not all metrics are supported by all AMD GPUs and some metrics may be commented

out within ActionAndAlertsFile.txt. To introduce/remove metrics from the Monitoring
sensors list, you can uncomment/comment out the associated lines inside
ActionAndAlertsFile.txt

as described in

“Action and alert files” (page 96)

NOTE:

HP Insight CMU dynamically determines if a client has working GPUs when monitoring

“Installing the

HP Insight CMU monitoring client” (page 85)

) and restart monitoring to ensure the GPUs are

recognized.

106 Monitoring a cluster with HP Insight CMU

Advertising