5 things to double-check, 6 things to note, Things to – HP StorageWorks Scalable File Share User Manual

Page 52

Advertising
background image

NOTE:

Passwordless ssh must be set up on the HP SFS servers before using this -c option.

5.2.5 Things to Double-Check

Ensure that the following conditions are met:

The .sig and .last files should be removed from /var/lib/heartbeat/crm when a
new cib.xml is copied there. Otherwise, Heartbeat ignores the new cib.xml and uses the
last one.

The /var/lib/heartbeat/crm/cib.xml file owner should be set to hacluster and the
group access permission should be set to haclient. Heartbeat writes cib.xml to add status
information. If cib.xml cannot be written, Heartbeat will be confused about the state of
other nodes in the failover group and may power cycle them to put them in a state it
understands.

The /etc/ha.d/authkeys file must be readable and writable only by root (mode 0600).

The host names for each node in /etc/ha.d/ha.cf must be the value that is returned
from executing the hostname or uname -n command on that node.

5.2.6 Things to Note

When Heartbeat starts, it waits for a period to give its failover peer time to boot and get
started. This time is specified by the init_dead parameter in the ha.cf file (60 seconds
in the example ha.cf file). Consequently, there may be an unexpected time lag before
Heartbeat starts Lustre the first time. This process is quicker if both nodes start Heartbeat
at about the same time.

Heartbeat uses iLO for STONITH I/O fencing. If a Heartbeat configuration has two nodes
in a failover pair, Both nodes should be up and running Heartbeat. If a node boots, starts
Heartbeat, and does not see Heartbeat running on the other node in a reasonable time, it
will power-cycle it.

5.2.7 Preventing Collisions Among Multiple HP SFS Servers

You may skip this section if no other HP SFS servers are on any of the accessible subnets.

If multiple HP SFS servers are installed on the same network, corresponding node pairs will
experience Heartbeat conflicts. For example, on two servers: Atlas with nodes atlas[1-4], and
World with nodes world[1-6], Heartbeat on nodes atlas1 and atlas2 will conflict with Heartbeat
on nodes world1 and world2. Nodes 3 and 4 of each server will experience the same conflict.

Although Heartbeat is working correctly on each server pair, error messages are reported in
/var/log/messages

. For example:

atlas1 heartbeat: [10762]: ERROR: process_status_message: bad node [world1] in message
atlas1 heartbeat: [10762]: ERROR: MSG: Dumping message with 12 fields
atlas1 heartbeat: [10762]: ERROR: MSG[0] : [t=status]
atlas1 heartbeat: [10762]: ERROR: MSG[1] : [st=active]
atlas1 heartbeat: [10762]: ERROR: MSG[2] : [dt=2710]
atlas1 heartbeat: [10762]: ERROR: MSG[3] : [protocol=1]
atlas1 heartbeat: [10762]: ERROR: MSG[4] : [src=smile1]
atlas1 heartbeat: [10762]: ERROR: MSG[5] : [(1)srcuuid=0x14870a38(36 27)]
atlas1 heartbeat: [10762]: ERROR: MSG[6] : [seq=7e2ebf]
atlas1 heartbeat: [10762]: ERROR: MSG[7] : [hg=4a1282e1]
atlas1 heartbeat: [10762]: ERROR: MSG[8] : [ts=4a90b239]
atlas1 heartbeat: [10762]: ERROR: MSG[9] : [ld=0.14 0.16 0.10 1/233 32227]
atlas1 heartbeat: [10762]: ERROR: MSG[10] : [ttl=3]
atlas1 heartbeat: [10762]: ERROR: MSG[11] : [auth=1 6954d02d4e8bb99db2a8c89dcaa537b5678e222a]

These error messages increase the size of the /var/log/messages file, making analysis difficult.
To prevent this issue, edit /etc/ha.d/ha.cf on every node, making sure that the mcast
multicast addresses are unique to that server node pair. For example, on atlas[1,2] leave the line:

mcast eth0 239.0.0.1 694 1 0

On world[1,2], change it:

52

Using HP SFS Software

Advertising