Problem: cannot start parallel task, Problem: bad performance, Problem: cannot start process on front end – PAR Technologies PARASTATION5 V5 User Manual

Page 34: 30 6.4. problem: bad performance, 30 6.6. problem: cannot start process on front end

Advertising
background image

Problem: cannot start parallel task

30

ParaStation5 Administrator's Guide

Or logged on to this node, run psiadmin which also starts up the ParaStation daemon

psid

. See

Section 6.1, “ Problem: psiadmin returns error ” for more details.

Check the logfile

/var/log/messages

on this node for error messages. Verify that all nodes have an

identical configuration (

/etc/parastation.conf

).

6.3. Problem: cannot start parallel task

Problem: a parallel task cannot be launched, an error is reported:

PSI: PSI_createPartition: Resource temporarily unavailable

Check for available nodes and active parallel tasks. Check for user or group restrictions.

If the error

PSI: dospawn: spawn to node 1 failed.

PSE: Could not spawn './mpi_latency' process 1, error = Bad \

file descriptor.

is reported, check if the current directory holding the program mpi_latency is accessible on all nodes.
Verify that the program is executable on all nodes.

6.4. Problem: bad performance

Verify that the proper interconnect and/or transport is used: check for environment variables controlling
transport (see Section 5.8, “Controlling ParaStation5 communication paths” and ps_environment(5)).

Watch protocol counters, e.g. counters indicating timeouts, retries, errors or other bad conditions. For
p4sock, check

recv_net_data

and

recv_user

. See Section 5.2, “ParaStation5 protocol p4sock”.

Look for a crystal bowl!

Or contact

<[email protected]>

.

6.5. Problem: different groups of nodes are seen as up
or down

Problem: depending on which node the psiadmin is run, different groups of nodes are seen as "up" or
"down".

Check for identical configuration on each node, e.g. compare the configuration file

/etc/

parastation.conf

on each node.

6.6. Problem: cannot start process on front end

Problem: Starting a job is canceled giving the error message

Connecting client 139.27.166.22:44784 (rank 6) failed : Network is

unreachable

PSIlogger: Child with rank 12 exited with status 1.

Advertising