Author Topic: limited number of time steps  (Read 9123 times)

aleximos

  • Jr. Member
  • **
  • Posts: 34
limited number of time steps
« on: April 22, 2014, 12:47:07 AM »
Hi!

I was calculating an example which was supposed to run over about 15000 time steps. While there are no errors written into the parrun.log the problems seems to terminate always after 7200 steps (I tried it a few times). During this there are 3 "error lines" in the console:

  Fatal error in PMPI_Waitall: See the MPI_ERROR field in MPI_Status for the error code
  [cli_0]: aborting job
  Fatal error in PMPI_Waitall: See the MPI_ERROR field in MPI_Status for the error code

or

  Fatal error in MPI_Recv: Other MPI error
  [cli_0]: aborting job
  Fatal error in MPI_Recv: Other MPI error

I can't find any other sign of what is going wrong in the output files. Is this because of the number of steps? Could there be any other reason? How may I get further information on what is going on there?

FEAP_Admin

  • Administrator
  • FEAP Guru
  • *****
  • Posts: 993
Re: limited number of time steps
« Reply #1 on: April 22, 2014, 06:02:31 PM »
There are no limitations built into FEAP regarding the number of time steps so that should not be the problem.
You can easily test this with a simple problem that you know is very well behaved -- like a linear elastic problem.  Pick
one that you can do serially as well as in parallel.

The error message itself says that you have encountered an MPI error but provides little additional detail.
To track this down you will have to determine which MPI call is causing the error and then interrogate the
error codes.  If you run with the PETSc option of "attach_debugger_on_error" you should be able to get better information.

aleximos

  • Jr. Member
  • **
  • Posts: 34
Re: limited number of time steps
« Reply #2 on: April 26, 2014, 02:19:00 AM »
Thanks for your answer!

I found out that the number of time steps is not the problem. After changing the example the error occurs at different time steps. Maybe it is necessary to change the subject of this thread?

I don't know exactly what you mean with "run with the PETSc option". This is the way I'm calling my parallel calculation inside a bash script:

"$petmpi -n $nproc -f $HOSTFILE $PARFEAP $options > parrun.log <<EOT
$parallel_ifile




y
EOT"

with the variables:

$petmpi - path of petscmpiexec (PETSC version 3.2-p7)
$nproc - number of processors
$HOSTFILE - file with ip's and number of processors for each ip
$PARFEAP - path of parFEAP's executable
$options - '-ksp_type cg -pc_type jacobi'
$parallel_ifile - name of the input file for first partition

Now "attach_debugger_on_error" would be an additional argument of this call? Or do I have to set it while configuring PETSc?

Actually, I did not not write that script and the person who did has gone, so there is no one to explain it to me. Nevertheless I would like to understand what is going on there and debug my problem. Do you know a good handbook to this topic?

Prof. S. Govindjee

  • Administrator
  • FEAP Guru
  • *****
  • Posts: 1164
Re: limited number of time steps
« Reply #3 on: April 26, 2014, 10:45:01 AM »
Actually it is on_error_attach_debugger.  The PETSc website lists lots of other very useful options.  Try
changing your options:

$options - '-ksp_type cg -pc_type jacobi' --> $options - '-ksp_type cg -pc_type jacobi -on_error_attach_debugger'

aleximos

  • Jr. Member
  • **
  • Posts: 34
Re: limited number of time steps
« Reply #4 on: April 30, 2014, 12:25:39 AM »
I changed the option as you recommended. When the error occured there was a window with the title "gdb" which told me that I did not have the right permissions and that I should run the script as root. For this it was necessary to add the exports of FEAP and PETSc into the root's ~/.bashrc.

Now console output is

"
  • PETSC ERROR: MPI error 15
  • [0] PETSC ERROR: PETSC: Attaching gdb to /usr/FEAP/ver83/parfeap/feap of pid 683 on display :0.0 on machine ubu641304
    Warning: Tried to connect to session manager, None of the authentication protocols specified are supported"

    also I copied and attached the output of the gdb-window.

    Please tell me, what to do next. Are there any helpful informations in these outputs?

Prof. S. Govindjee

  • Administrator
  • FEAP Guru
  • *****
  • Posts: 1164
Re: limited number of time steps
« Reply #5 on: April 30, 2014, 12:35:56 AM »
You'll need to talk with your sys admin to have them help you set up the correct permissions.

aleximos

  • Jr. Member
  • **
  • Posts: 34
Re: limited number of time steps
« Reply #6 on: April 30, 2014, 12:49:02 AM »
Actually I'm the admin (of this machine). The output above resulted of a parFEAP run with root-account.

FEAP_Admin

  • Administrator
  • FEAP Guru
  • *****
  • Posts: 993
Re: limited number of time steps
« Reply #7 on: April 30, 2014, 11:23:12 PM »
It looks like gdb was attached upon error.  Now you need to use gdb to investigate what is going on.  Type things like 'where' to see where the program was when it failed etc.  See the gdb manual for lots of information on debugging options.  Note, for this to be effective you have to make sure that you have built debugging versions (i.e. used the -g flag when you built FEAP, PETSc, etc.)