Related events, Node failed to rejoin srd on start-up” event – HP Matrix Operating Environment Software User Manual
Page 42

the CMS to have the CMS re-deploy its view of the SRD. If the CMS cannot be contacted, the SRD
in the deployed.config file is deployed as long as all nodes agree.
In general, when an SRD is disrupted by a node’s going down, by a CMS's going down, or by
network communications issues, gWLM attempts to reform the SRD. gWLM maintains the concept
of a cluster for the nodes in an SRD. In a cluster, one node is a master and the other nodes are
nonmasters. If the master node loses contact with the rest of the SRD, the rest of the SRD can continue
without it, as a partial cluster, by unanimously agreeing on a new master. If a nonmaster loses
communication with the rest of the SRD, the resulting partial cluster continues operation without
the lost node. The master simply omits the missing node until it becomes available again.
You can use the gwlmstatus command to monitor availability. It can tell you whether any hosts
are unable to rejoin a node's SRD as well as whether hosts in the SRD are nonresponsive. For more
information, see gwlmstatus(1M).
NOTE:
Attempts to reform SRDs might time out, leaving no SRD deployed and consequently no
management of resource allocations. If this occurs, see the HP Matrix Operating Environment
Release Notes and follow the actions suggested in the section titled “Data Missing in Real-time
Monitoring.”
Related events
You can configure the following System Insight Manager events regarding this automatic restart
feature:
•
Node Failed to Rejoin SRD on Start-up
•
SRD Reformed with Partial Set of Nodes
•
SRD Communication Issue
For information on enabling and viewing these events, refer to Optimize
→Global Workload
Manager
→Events.
You can then view these events using the Event Lists item in the left pane of System Insight Manager.
The following sections explain how to handle some of the events.
“Node Failed to Rejoin SRD on Start-up” event
If you see the event “Node Failed to Rejoin SRD on Start-up”:
1.
Restart the gwlmagent on each managed node in the affected SRD:
# /opt/gwlm/bin/gwlmagent --restart
2.
Verify the agent rejoined the SRD by monitoring the Shared Resource Domain View in System
Insight Manager or by using the gwlm monitor command.
3.
If the problem persists, check the files /var/opt/gwlm/gwlmagent.log.0 and /var/
opt/gwlm/gwlmcmsd.log.0
for additional diagnostic messages.
“SRD Communication Issue” and “SRD Reformed with Partial Set of Nodes” events
NOTE:
Reforming with a partial set of nodes requires a minimum of three managed nodes in the
SRD.
NOTE:
“SRD Communication Issue” events are not enabled by default. To see these events,
configure your events in System Insight Manager through the HP Matrix OE visualization menu
bar using Tools
→Global Workload Manager→Events.
If you have an SRD containing n nodes and you get n - 1 of the “SRD Communication Issue” events
but no “SRD Reformed with Partial Set of Nodes” events within 5 minutes (assuming an allocation
interval of 15 seconds) of the first “SRD Communication Issue” event, you might need to restart the
gwlmagent
on each managed node in the affected SRD:
42
Additional configuration and administration tasks