FEAP User Forum
FEAP => Parallel FEAP => Topic started by: Jacopo on December 07, 2022, 07:50:14 AM
-
Dear FEAP community,
I am facing issues verifying the scaling properties of a simple problem solved using parfeap. I am trying to reproduce the test in appendix D.1.2 of the Parallel User Manual, Version 8.6.
Calculations do converge, but the number of KSP iterations is unreasonably high. On top of that, the solution time according to petsc log file is the same no matter how many processors I call.
I am using FEAP 8.6.1n and PETSc 3.13.2.
In the attachment, you can find the input files employed and the log files from petsc for two and four processors, respectively.
Any help from your side is greatly appreciated.
I appreciate any help you can provide.
Jacopo
-
I was not able to find the test case from the manual (too many years ago).
Here is remake of that problem:
feap
0 0 0 3 3 8
material
solid
elastic neoh 1000000 0.1
param
n = 50
block
cart n n n
1 0 0 0
2 1 0 0
3 1 1 0
4 0 1 0
5 0 0 1
6 1 0 1
7 1 1 1
8 0 1 1
eboun
1 0 1 1 1
eforce
1 1 1 1 1
end
batch
graph node 2
outd aij 1
end
stop
I did not have time to load the problem on a proper machine but did run it on my laptop. Here are the times for some of the run. Note this is with a code not compiled for timing (rather for debugging).
MacBook Pro (2.3GHz 8-Core Intel Core 9) 32GB
2 processes
Method KSP-Solve Total
feaprun 2.7134e+01 4.3643e+01
feaprun-gmres 7.3397e+01 9.0897e+01
feaprun-gamg 4.5849e+00 3.3808e+01
feaprun-gamg-2 1.2722e+01 4.5899e+01
feaprun-ml 1.6809e+01 3.8593e+01
feaprun-hypre-boomer 4.0171e+01 1.0388e+02
4 processes
Method KSP-Solve Total
feaprun 2.1360e+01 3.1046e+01
feaprun-gmres 5.7745e+01 6.7655e+01
feaprun-gamg 3.6294e+00 2.0613e+01
feaprun-gamg-2 9.2478e+00 2.8235e+01
feaprun-ml 1.1214e+01 2.3489e+01
feaprun-hypre-boomer 2.9710e+01 8.5120e+01
-
I did a slightly better test but just using CG with the Jacobi preconditioner. Speed ups depend quite a bit on the architecture or your hardware. Lots of systems are memory bandwidth constrained and this is problematic for Ax=b type problems. Notwithstanding on a machine with Intel Xeon Skylake 6130 @ 2.1 GHz, if I restrict to 4 tasks per chip socket (even though there are 16 cores on the chip), then I see
timings of
2 processes: 61.88
4 processes: 31.60
8 processes: 18.67
16 processes: 11.24
32 processes: 8.9
In the batch scheduler, I asked for 4 nodes each with 2 chip sockets (so a total of 8 chips sockets and 128 cores) and indicated 4 cores per mpi task. How the scheduler split the jobs up for the lower number of mpi tasks I do not know. For the 32 mpi task run, I assume it put 4 tasks on each chip. To get better speed up, which I think is possible, I will need to try some more parameter combinations with the scheduler. Note that if I ask for just 1 compute node and use 1 core per mpi task then the timing comes out as
2 processes: 60.67
4 processes: 31.23
8 processes: 19.59
16 processes: 15.45
32 processes: 18.30
-
I did two more tests. With cg and jacobi for the pc and 16 cores per mpi task the problem gives
2 processes: 60.9
4 processes: 28.77
8 processes: 14.62
16 processes: 7.83
32 processes: 9.42
If I use gamg for the pre-conditioner and 16 cores per mpi task the problem gives
2 processes: 38.9
4 processes: 21.05
8 processes: 11.48
16 processes: 6.857
32 processes: 5.27
Over all it looks like if one does not overload the memory buses then the code scales nearly ideally. At 32 processes one is seeing a slowdown because there is too much message passing (i.e. the problem is too small to benefit from parallel computation).
-
Dear Professor,
thanks for the thorough analysis and sorry for the late reply. I am currently working on the batch scheduler for actually splitting the problem on different cores, since so far I was only employing different processes on a single one.
Jacopo