FEAP User Forum
FEAP => Parallel FEAP => Topic started by: TUDoR on June 03, 2016, 06:04:49 AM
-
Dear Feap Users,
I have a problem when I start a parfeap job on multiple nodes which are given by a host_file (compute0:24, compute1:24, etc.).
A minimal example of a mpi-hello-world-code works properly, but when I execute parfeap via, e.g.,
mpirun -n 48 parfeap
I get the following error message:
INTERNAL ERROR: Invalid error class (66) encountered while returning from
PMPI_Bcast. Please file a bug report.
Fatal error in PMPI_Bcast: Unknown error. Please file a bug report., error stack:
(unknown)(): connection failure
[cli_0]: aborting job:
Fatal error in PMPI_Bcast: Unknown error. Please file a bug report., error stack:
(unknown)(): connection failure
Does by chance anybody know what causes this problem and how to encounter it?
-
First please give some more details on what software/libraries you are using: which version of FEAP/PETSc/MPI are you using?
Second you should run everything in debug mode, so reconfigure PETSc with the flag: --with-debugging=0, and then recompile, then recompile FEAP with debug flags -g -Wall passed to the Fortran compiler. Then if you have valgrind installed on the cluster you can use that too.
Make sure that you use absolute path in the mpirun command:
mpirun -n 48 $FEAPHOME8_X/parfeap/feap
make sure that the parallel input files are generated correctly and that they are accessible to all the processes.
-
This seems to be an internal MPI/PETSc issue. See the following (old) bug report
https://trac.mpich.org/projects/mpich/ticket/1565
for some clues on the issue.