Author Topic: Parallel running.  (Read 6898 times)

shenrilin

  • Full Member
  • ***
  • Posts: 67
Parallel running.
« on: August 24, 2017, 06:35:31 PM »
Dear Feap team,

I'm solving a nonlinear problem with the nonsymmetric jacobian matrix.

Firstly, I used direct solver superlu_dist with 16 or 32 processors. It works.

Then, I tried an iterative solver gmres with preconditioner pilut.
Code: [Select]
mpirun -np 8 $FEAPHOME8_3/parfeap/feap -ksp_type gmres -pc_type hypre -pc_hypre_type pilutIt works for 1 or 2 processers. However, when I tried more processors, It prompts errors as:

8 processors (The first time step has been finished):

Code: [Select]
Saving Parallel data to PUEX_000000.pvtu
Saving Parallel data to PUEX_000001.pvtu
*** An error occurred in MPI_Irecv
*** reported by process [2564947969,6]
*** on communicator MPI COMMUNICATOR 5 DUP FROM 3
*** MPI_ERR_REQUEST: invalid request
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)

16 processors(The first time step was not completed):

Code: [Select]
Saving Parallel data to PUEX_000000.pvtu
** Error in `/rigel/free/users/rs3741/SourceCode/ShearBands/parfeap/feap': free(): invalid next size (fast): 0x0000000002391ae0 ***

Do you have any ideas on this issue?

Best regards.



Prof. S. Govindjee

  • Administrator
  • FEAP Guru
  • *****
  • Posts: 1160
Re: Parallel running.
« Reply #1 on: August 24, 2017, 08:01:58 PM »
Can you run with the additional option of -on_error_attach_debugger and then figure out where the error is coming from?  In particular, is this a FEAP error or is this a PETSc error?

Can you also try using an other solver to see what happens?  Maybe GAMG (-ksp_type cg -ksp_monitor -log_view -pc_type gamg -pc_gamg_type agg -pc_gamg_agg_nsmooths 1) etc.  It is important to narrow down the origin of the error in the code.

[I assume you are using UTANgent,,1 for your solution step in FEAP.]

shenrilin

  • Full Member
  • ***
  • Posts: 67
Re: Parallel running.
« Reply #2 on: August 25, 2017, 08:28:42 AM »
Dear Prof. Govindjee,

I followed your suggestions and got the following information(with 16 processors):

Code: [Select]
Saving Parallel data to PUEXsqubar_000000.pvtu
[0]PETSC ERROR: PetscTableFind() line 126 in /rigel/free/users/rs3741/SourceCode/petsc-3.5.4/include/petscctable.h key 28495 is greater than largest key allowed 20677
[1]PETSC ERROR: PetscTableFind() line 126 in /rigel/free/users/rs3741/SourceCode/petsc-3.5.4/include/petscctable.h key 26099 is greater than largest key allowed 20677
[5]PETSC ERROR: PetscTableFind() line 126 in /rigel/free/users/rs3741/SourceCode/petsc-3.5.4/include/petscctable.h key 33717 is greater than largest key allowed 20677
[4]PETSC ERROR: PetscTableFind() line 126 in /rigel/free/users/rs3741/SourceCode/petsc-3.5.4/include/petscctable.h key 36155 is greater than largest key allowed 20677
[7]PETSC ERROR: PetscTableFind() line 126 in /rigel/free/users/rs3741/SourceCode/petsc-3.5.4/include/petscctable.h key 37087 is greater than largest key allowed 20677
--------------------------------------------------------------------------
A process has executed an operation involving a call to the
"fork()" system call to create a child process.  Open MPI is currently
operating in a condition that could result in memory corruption or
other system errors; your job may hang, crash, or produce silent
data corruption.  The use of fork() (or system() or other calls that
create child processes) is strongly discouraged.

The process that invoked fork was:

  Local host:          [[24028,1],1] (PID 25215)

If you are *absolutely sure* that your application will successfully
and correctly survive a call to fork(), you may disable this warning
by setting the mpi_warn_on_fork MCA parameter to 0.
--------------------------------------------------------------------------
[1]PETSC ERROR: PETSC: Attaching gdb to /rigel/free/users/rs3741/SourceCode/ShearBands/parfeap/feap of pid 25215 on display :0.0 on machine node037
[5]PETSC ERROR: PETSC: Attaching gdb to /rigel/free/users/rs3741/SourceCode/ShearBands/parfeap/feap of pid 25219 on display :0.0 on machine node037
[0]PETSC ERROR: [4]PETSC ERROR: PETSC: Attaching gdb to /rigel/free/users/rs3741/SourceCode/ShearBands/parfeap/feap of pid 25214 on display :0.0 on machine node037
PETSC: Attaching gdb to /rigel/free/users/rs3741/SourceCode/ShearBands/parfeap/feap of pid 25218 on display :0.0 on machine node037
xterm: Xt error: Can't open display: :0.0
xterm: DISPLAY is not set
xterm: Xt error: Can't open display: :0.0
xterm: DISPLAY is not set
xterm: Xt error: Can't open display: :0.0
xterm: DISPLAY is not set
xterm: Xt error: Can't open display: :0.0
xterm: DISPLAY is not set
xterm: Xt error: Can't open display: :0.0
xterm: DISPLAY is not set

I think it is the petsc that leads to the error. Petsc was installed in debugging mode.

Best,
« Last Edit: August 25, 2017, 08:34:50 AM by shenrilin »

Prof. S. Govindjee

  • Administrator
  • FEAP Guru
  • *****
  • Posts: 1160
Re: Parallel running.
« Reply #3 on: August 25, 2017, 08:45:11 AM »
I suggest you search the PETSc user forum archives for this error; it may already be known.  You can also post a help query to the PETSc users list <petsc-users@mcs.anl.gov> .

shenrilin

  • Full Member
  • ***
  • Posts: 67
Re: Parallel running.
« Reply #4 on: August 25, 2017, 01:40:16 PM »
Dear Professor,

Thanks very much.

Best,

shenrilin

  • Full Member
  • ***
  • Posts: 67
Re: Parallel running.
« Reply #5 on: August 26, 2017, 04:19:28 PM »
I suggest you search the PETSc user forum archives for this error; it may already be known.  You can also post a help query to the PETSc users list <petsc-users@mcs.anl.gov> .

Dear Professor,

I want to update a new thing for this problem.

Firstly I write a user element with four dofs per nodes, three displacements and one phase field. It didn't work for 8 or more processors using GMRES.

Then I removed the phase field dof to form a new user element. Then I run this problem with 8 processors using GMRES, it works.

It's too weird, Is there any requirement that needs to tell petsc how many dofs the user element has?

Best,


FEAP_Admin

  • Administrator
  • FEAP Guru
  • *****
  • Posts: 993
Re: Parallel running.
« Reply #6 on: August 27, 2017, 08:39:32 AM »
There should be no restrictions on the number of dofs.  I would suggest trying with a small problem and visually examining the tangent matrix to make sure that it is being assembled correctly.