FEAP User Forum

FEAP => Parallel FEAP => Topic started by: Jacopo on December 07, 2022, 07:50:14 AM

Title: Parallel validation
Post by: Jacopo on December 07, 2022, 07:50:14 AM
Dear FEAP community,
I am facing issues verifying the scaling properties of a simple problem solved using parfeap. I am trying to reproduce the test in appendix D.1.2 of the Parallel User Manual, Version 8.6.
Calculations do converge, but the number of KSP iterations is unreasonably high. On top of that, the solution time according to petsc log file is the same no matter how many processors I call.
I am using FEAP 8.6.1n and PETSc 3.13.2.
In the attachment, you can find the input files employed and the log files from petsc for two and four processors, respectively.
Any help from your side is greatly appreciated.
I appreciate any help you can provide.
Jacopo
Title: Re: Parallel validation
Post by: Prof. S. Govindjee on December 08, 2022, 03:59:45 PM
I was not able to find the test case from the manual (too many years ago).

Here is remake of that problem:
Code: [Select]
feap
0 0 0 3 3 8

material
 solid
  elastic neoh 1000000 0.1

param
 n = 50

block
 cart n n n
 1 0 0 0
 2 1 0 0
 3 1 1 0
 4 0 1 0
 5 0 0 1
 6 1 0 1
 7 1 1 1
 8 0 1 1

eboun
 1 0 1 1 1

eforce
 1 1  1 1 1

end

batch
 graph node 2
 outd aij 1
end


stop

I did not have time to load the problem on a proper machine but did run it on my laptop.  Here are the times for some of the run.  Note this is with a code not compiled for timing (rather for debugging).

Code: [Select]
MacBook Pro (2.3GHz 8-Core Intel Core 9) 32GB
2 processes

Method                  KSP-Solve       Total
feaprun                 2.7134e+01      4.3643e+01                     
feaprun-gmres           7.3397e+01      9.0897e+01
feaprun-gamg            4.5849e+00      3.3808e+01
feaprun-gamg-2          1.2722e+01      4.5899e+01
feaprun-ml              1.6809e+01      3.8593e+01
feaprun-hypre-boomer    4.0171e+01      1.0388e+02

4 processes

Method                  KSP-Solve       Total
feaprun                 2.1360e+01      3.1046e+01
feaprun-gmres           5.7745e+01      6.7655e+01
feaprun-gamg            3.6294e+00      2.0613e+01
feaprun-gamg-2          9.2478e+00      2.8235e+01
feaprun-ml              1.1214e+01      2.3489e+01
feaprun-hypre-boomer    2.9710e+01      8.5120e+01
Title: Re: Parallel validation
Post by: Prof. S. Govindjee on December 12, 2022, 04:57:45 PM
I did a slightly better test but just using CG with the Jacobi preconditioner.  Speed ups depend quite a bit on the architecture or your hardware.  Lots of systems are memory bandwidth constrained and this is problematic for Ax=b type problems.  Notwithstanding on a machine with Intel Xeon Skylake 6130 @ 2.1 GHz, if I restrict to 4 tasks per chip socket (even though there are 16 cores on the chip), then I see
timings of
Code: [Select]
2 processes: 61.88
4 processes: 31.60
8 processes: 18.67
16 processes: 11.24
32 processes: 8.9
In the batch scheduler, I asked for 4 nodes each with 2 chip sockets (so a total of 8 chips sockets and 128 cores) and indicated 4 cores per mpi task.  How the scheduler split the jobs up for the lower number of mpi tasks I do not know.  For the 32 mpi task run, I assume it put 4 tasks on each chip.  To get better speed up, which I think is possible, I will need to try some more parameter combinations with the scheduler.  Note that if I ask for just 1 compute node and use 1 core per mpi task then the timing comes out as
Code: [Select]
2 processes: 60.67
4 processes: 31.23
8 processes: 19.59
16 processes: 15.45
32 processes: 18.30
Title: Re: Parallel validation
Post by: Prof. S. Govindjee on December 12, 2022, 07:41:53 PM
I did two more tests.  With cg and jacobi for the pc and 16 cores per mpi task the problem gives
Code: [Select]
2 processes: 60.9
4 processes: 28.77
8 processes: 14.62
16 processes: 7.83
32 processes: 9.42

If I use gamg for the pre-conditioner and 16 cores per mpi task the problem gives
Code: [Select]
2 processes: 38.9
4 processes: 21.05
8 processes: 11.48
16 processes: 6.857
32 processes: 5.27

Over all it looks like if one does not overload the memory buses then the code scales nearly ideally.  At 32 processes one is seeing a slowdown because there is too much message passing (i.e. the problem is too small to benefit from parallel computation).
Title: Re: Parallel validation
Post by: Jacopo on December 19, 2022, 02:45:28 AM
Dear Professor,
thanks for the thorough analysis and sorry for the late reply. I am currently working on the batch scheduler for actually splitting the problem on different cores, since so far I was only employing different processes on a single one.

Jacopo