Author Topic: Execute parfeap on cluster with several nodes  (Read 4412 times)

TUDoR

  • Jr. Member
  • **
  • Posts: 17
Execute parfeap on cluster with several nodes
« on: June 03, 2016, 06:04:49 AM »
Dear Feap Users,

I have a problem when I start a parfeap job on multiple nodes which are given by a host_file (compute0:24, compute1:24, etc.).
A minimal example of a mpi-hello-world-code works properly, but when I execute parfeap via, e.g.,
mpirun -n 48 parfeap

I get the following error message:
INTERNAL ERROR: Invalid error class (66) encountered while returning from
PMPI_Bcast.  Please file a bug report.
Fatal error in PMPI_Bcast: Unknown error.  Please file a bug report., error stack:
(unknown)(): connection failure
[cli_0]: aborting job:
Fatal error in PMPI_Bcast: Unknown error.  Please file a bug report., error stack:
(unknown)(): connection failure

Does by chance anybody know what causes this problem and how to encounter it?

luc

  • Full Member
  • ***
  • Posts: 53
Re: Execute parfeap on cluster with several nodes
« Reply #1 on: June 03, 2016, 07:17:34 AM »
First please give some more details on what software/libraries you are using: which version of FEAP/PETSc/MPI are you using?
Second you should run everything in debug mode, so reconfigure PETSc with the flag: --with-debugging=0, and then recompile, then recompile FEAP with debug flags -g -Wall passed to the Fortran compiler. Then if you have valgrind installed on the cluster you can use that too.

Make sure that you use absolute path in the mpirun command:

mpirun -n 48 $FEAPHOME8_X/parfeap/feap

make sure that the parallel input files are generated correctly and that they are accessible to all the processes.

FEAP_Admin

  • Administrator
  • FEAP Guru
  • *****
  • Posts: 993
Re: Execute parfeap on cluster with several nodes
« Reply #2 on: June 03, 2016, 07:26:33 AM »
This seems to be an internal MPI/PETSc issue.  See the following (old) bug report

https://trac.mpich.org/projects/mpich/ticket/1565

for some clues on the issue.