FEAP User Forum
FEAP => Parallel FEAP => Topic started by: arktik on April 04, 2021, 06:01:57 AM
-
Dear FEAP Admin,
The convergence rate of parallel FEAP v8.6 (8.6.1i) seems to be suboptimal when compared to v8.5 (8.5.2i). I checked with different boundary value problems (purely mechanical). Here is what I found out:
- With PETSc OFF (without partitioning), both versions give identical convergence rates.
- With PETSc ON (with partitioning), v8.6 gives lower convergence rate -- or a very high residual norm for same number of iterations. v8.5 is not effected.
- The solution accuracy is not effected in either case
The check was performed with original source code without compiling any user-defined modifications. For your reference, I am attaching the test examples used for the above conclusions. Please let us know what is happening and how it can resolved.
For one of the more complex problems (~2000 material tags) not attached in zip directory, v8.6 diverged to NaN, where v8.5 gave expected results.
Additional Info:
v8.5 is installed with the following major dependencies a) GCC-7.3.1 b) OpenMPI-3.1.1 c) PETSc-3.11.1
v8.6 is installed with the following major dependencies a) GCC-7.3.1 b) OpenMPI-4.0.4 c) PETSc-3.13.2
Sincerely
-
Thanks for the sample files. We will have a look.
However, one quick question. Do you know if the partitionings are different from 8.5 and 8.6?
-
Thank you Prof. Govindjee for a quick response. I am not really sure if I understood what you mean by "if partitioning is different". Are you referring to this topic http://feap.berkeley.edu/forum/index.php?topic=2436.0 (http://feap.berkeley.edu/forum/index.php?topic=2436.0)? This has been taken care of while performing the above test.
In the tested examples, the partitioning is done with
GRAPh NODE <nproc>
OUTDomains AIJ 1
as shown in each example input file (and explained in section 1.3.1 of parmanual_86.pdf). I think this should lead to identical partitions for both versions(?).
-
Yes, that is precisely the question. Do you know if the partitioned input files contain the same distribution of nodes?
-
I checked the output files. The partitioning is identical i.e. number of nodal points and number of elements generated in each of the partitioned files are same for both the versions (for all examples).
-
Thank you for that diagnostic. If the partitionings are the same then it is hard to imagine why the convergence behavior is so different.
Are you sure that both versions have been compiled with the same options (debugging or not; and same level of optimization)?
If those points are the same, then can you post the output of the petsc log and ksp monitor (for one of your examples)?
-
I have attached the ksp monitor outcome and log view for both versions only for ex2. I am a bit baffled that from these petsc generated reports nothing is really apparent as they are almost identical. Both versions were compiled with the similar options/flags (as seen in log views).
-
Your KSP monitor logs look to be about the same (as do your overall PETSC logs) I was under the impression that you were seeing different iterative convergence with the KSP solver?
-
Sorry for the confusion. By slower rate of convergence, I meant the values printed by FEAP in its own log files e.g. starting with Lxxx_0001 and so on. KSP monitor and petsc log files apparently show similar behavior for both versions.
-
Ok. Thanks for the clarification. Can you post the L-files for your ver85 and ver86 runs.
-
For each of the example files above, the L-files for v85 and v86 are attached. In the meantime, I also tested if different openmpi and petsc versions used in the testing above could be the source of the problem. However, that is not the case. For v85 and v86 compiled with identical petsc and openmpi versions, this problem still exists.
-
Thanks.
I'll have a deeper look.
-
The number of Newton iterations is same in your examples.
The only difference that I have recognized is that the final residuum norm with FEAP86 is always larger then with FEAP85.
But convergence was accepted by FEAP based on the energy norm which was small enough with both version.
It would be interesting to know why FEAP85 has achieved a better residuum norm in all three examples at all steps.
However, convergence behaviour looks ok for me.
-
If one looks at the energy norms from the two codes they are essentially identical sometimes even to 15 digits. The issue with the residual norm appears to be related to some changes that have been made to the serial code which are not quite correctly implemented in the parallel code. A patch is being developed.
-
Thanks Prof. Govindjee for the assessment. Yes, I also noticed that the energy norm was almost identical. As I mentioned, for a more complex problem (the tested examples were very simplified cases), ver86 simply fails to show convergence. I am assuming, it is not just the incorrect calculation of the residuum norm but the tangent modulus itself is involved?! Please let me know if you want the more complicated test case with ~2000 material tags (> 6x10^6 DOFs) as well.
-
Dear FEAP Admin,
I have tested the new update 8.6.1j. The slow convergence problems reported above (in comparison with v8.5) seems to be resolved. Thanks for the support.
However, there seems to be a problem with the log files in parallel. Except the first one Lxxx_0001, log files from other cpus (e.g Lxxx_0002 ... Lxxx_000n) have no output (logs for solution steps are missing).
-
Thanks. I have reproduced the 'bug'; though I will note the files should be the same -- except for maybe the timing values on the end of each line.
-
Actually this is a 'feature' to save disk I/O since the information was the same in all the files. In pmacr2.f, there is now a check that rank.eq.0 before writing to the Log file.
-
I totally agree that the other L-files need not be fully printed. Their size gets really large e.g. in case of transient explicit problems over 100s of CPUs. Did not know that this is now an intended feature. Thanks!
-
In the meantime, I also checked the new version 8.6.1j with a more complicated problem (as mentioned in the first post in this topic). This problem has ~2000 material tags (each one an orthotropic crystal having unique orientation) and ~6 million DOFs. Unfortunately, the latest update also does not show convergence. As I said, the previous version (v8.5.2.i) works correctly. The following error message is shown
NO CONVERGENCE REASON: Iterations exceeded
Only FEAP standard features are used to solve this problem. I suspect the simple checks performed previously were not enough to trigger a possible bug.
-
Hmmm...this is going to be challenging to debug with 6M dof.
As I understand, you now have 8.5.2i and 8.6.1j using the same version of openmpi and petsc.
(1) Can you (re)post your petsc logs and the L files from both versions with your 6M dof test case.
(2) Also which FEAP material model are you using?
(3) Do you know if the results are the same up the time step before the problem arises?
-
8.5.2i and 8.6.1j are not installed with the same version of openmpi and petsc (as mentioned in first post). It's hard to say if that plays any role since 8.6.1j exhibits no other issues when compared to 8.5.2i installation. Regarding the current problem.
- With v8.5, KSP residual norm reaches tolerance within max iterations (petsc default 10000). With v8.6 KSP residual norm does not reach tolerance at the end of 10000 iterations
- To make it easy for v8.6, the default petsc tolerances are increased e.g 1e-16 for energy and 1e-8 for residual. Even then v86 doesn't converge
- Each grain is supposed to be orthotropic elasticplastic (Hill plasticity). However, for comparison I reduced the complexity to elastic isotropic. That means we are eventually solving a linear elastic domain with ~2000 repetitive material tags
- Since v86 shows no convergence right from start, the L-file is empty. However, I have attached O-file from first processor and petsc log for both versions. Only performing single time increment for testing
-
Looking at the information you have provided, I see that the solution is getting off to a bad start without even trying a solution.
You can see that the very first residual that you are computing between the two programs is different and this should not be the case -- though I will point out that the technical details of how those residuals are computed differs between 8.5 and 8.6. Most likely the tangents are different too.
To help focus in it will be helpful to know
(1) the exact lines you are using to partition your equations
(2) what happens if you just use one material and not 2000.
(3) can you start serial FEAP on your problem and run FORM to get the expected residual? Note this expected residual will be the residual that one expects in parallel if you have output the parallel files using OUTD,AIJ and not OUTD,AIJ,1
-
My recent troubleshooting didn't give any conclusive results. To your points,
1. Partition is done with
BATCh
GRAPh NODE ncpu
OUTDomain aij 1 3
END
I used non-flat as well as flat file. Both give diverged solution.
2. I created a block mesh (roughy same DOF) with one material tag, the solution converges and yields correct solution! The original problem uses INCLude statement to get discretized geometry (coordinates and elements) of 2000 grains. May be there is something happening here in new release?
3. The problem is too large that serial FEAP throws memory error (also with DIREct SPARse).
-
In your material specification, are all the materials different, or only their parameters? Do they use history variables? It seems that changes in 8.5 to 8.6 may have something to do with the number of material sets.
Did you try CG with the serial version?
-
All materials are same (e.g. testing done with elastic isotropic) with different material constants. However, the simplified version (linear elastic isotropic - no history) does not work either for a problem with ca. 2000 'elastic isotropic' grains having 2x10^ DOF.
I also ran serial FEAP (v8.6.1j) with ITERATIOn BPCG. It helped a bit. From **Diverged due to Indefinite Matrix** or *NO Convergence: ITERATIONS exceeded* messages, BPCG shows convergence although very slow (e.g. 10 Newton iterations achieve R_norm ~ 1E-5).
I found out that another BVP which uses just one material tag (a user element) with ca. 10x10^6 DOF does not converge with v8.6.1j. The same BVP and user element works without trouble with v8.5.2i.
-
Just out of curiosity, what happens if you run the command CHECk?
The mystery is still why 8.5 works and 8.6 does not?
-
CHECk shows no red flags. For your information, here is the output of the test problem
Restrained Degree-of-Freedom Check
DOF Fixed
----------------
1 816827
2 807531
3 808627
----------------
On 2000376 Nodes
A brief summary of findings so far:
- The new release v8.6.1j definitely improved convergence in parallel FEAP compared to v8.6.1h
- The conclusion in (1) is based on testing simple problems using standard library with DOF < 1E4
- More complex problems - DOF > 1E6 from INCLude files (mesh) and/or material tags > 1000 fail to converge
- However, problems with mesh generated from BLOCK when DOF > 1E6 runs without problem
Overall, I am suspecting some issue with using INCLude statement to import coordinates and elements (created for FEAP from third-party programs). But I may be totally wrong!
-
The only prob with import is if too many digits for numbers, more than. 15
When you did CHECk were you using a FEAP element or one of yours. I think check would catch problems by an include
-
The CHECk was performed with the standard library element. The digits for numbers in imported mesh are less than 15.
-
One thought is that there is a memory over write error someplace that is being exposed by the large number of includes or large number of materials.
I would be good to first isolate which is the cause, materials or includes.
Then, as painful as it may be, I would run the code with valgrind to get for something getting clobbered. The fact that this does not appear in 8.5 could be a fluke of how the memory blocks are being assigned.
-
If the issue is caused by a memory violation, then this can be perhaps tested via valgrind at a constrained problem.
All element and material routines would be executed (including the memory bug) but the large system of equations can be avoided by fixing all DOFs via boundary conditions.
-
Hard to say if this suggestion will work, since the memory corruption could be to the tangent's memory...
-
My ongoing troubleshooting with parallel v8.6.1j has shed some new light on the possible origins of the bug. I haven't yet incorporated debugging with valgrind. So far it's more of a mechanistic debugging :-\
1. Two identical BVPs (DOF>1E6) are tested: one where mesh is generated with FEAP BLOCk and second where mesh is imported with INCLude using 3rd party programs. Both work correctly with standard library as well as user element. Therefore, a potential bug in INCLUde can be ruled out.
2. Partitioning done with v8.5.2 is slightly different from v8.6.1. E.g. v8.5.2 prints EREGions in partitioned files which is missing in v8.6.1.
3. Series of identical BVPs with increasing number of grains (=material tags) are performed. When grains (material tags) > 999, user element (solid3d with ndf=4) throws the following error
Material Number1000: Element Type: user : ELMT =***
Element Material Set = 1
*ERROR* ELMLIB: Element: 0, type number*** input, isw = 1
RANK = 0
Feap standard library (3d elastic orthotropic) does not throw this error but simply stops converging.
4. For grains (material tags) <=999 standard library and user elements work correctly doesn't matter how many DOF.
Somehow the argument jel to program/elmlib.f seems to be corrupted when material tags > 999. One possible explanation as to how this corruption leads to divergence is that elements with material tags > 999 have garbage properties. This I haven't tested yet.
-
If you look in the 8.6 partition files you should find EREGions defined (just not in all of them). I'll have a look at the jel issue.
-
You should change format statement 4000 in elmlib.f, i3 --> i5 so we can see what is actually in jel.
Note, jel should fit in an i3 format. When using user elements there are only up to 50 user elements allowed, elmt01 through elmt50. Feap's internal elements have negative numbers for jel.
The other thing to note is that the problem in point 3 is occurring early on. isw is equal to 1. so you are still in the input stage.
-
Can you post a sample of what the material cards look like in the partitioned files? for material 1000 or higher?
-
Another question: How many parameters are you saving into d( ) or ud( ) in your user element?
-
Partial good news. I have created a small problem with 32x32 mesh with 1024 material sets that fails in the way you are seeing using FEAP elements (i.e. no convergence). This will help debugging from our side.
-
For the record, here are the files. Serial files and a 4-partitioning of them.
-
I have also been able to show that this is not an issue of partitioning. If I make a single partition (parallel) input file it also fails and using a director solver I get a zero pivot error.
-
I see the problem :(
The format statement for writing out the material cards to the parallel files (also for creating flat files from serial FEAP) is incorrect. When you have large numbers of materials there is not enough room and two fields get jammed together.
I work up a patch later today.
-
Try this.
Edit program/pmatin.f and change format statement 2008 from
2008 format(2x,a15,1x,15i4:/(16i4))
to
2008 format(2x,a15,1x,15i5:/(16i4))
This is a quick fix (I hope).
Note you will need to rebuild the FEAP archive and then rebuild your serial and parallel feap executables.
-
Thank you Prof. Govindjee for the prompt confirmation and the solution :). I will test it tomorrow.
By the way, input cards (printed in partitioned input files) for the FEAP element (number 1000) is:
MATErial 1000
SOLId 01000 1 2 3
ELAStic MODUli 6
1.76000e+05 9.10000e+04 6.80000e+04 0.00000e+00 0.00000e+00 0.00000e+00
9.10000e+04 1.75000e+05 6.80000e+04 0.00000e+00 0.00000e+00 0.00000e+00
6.80000e+04 6.80000e+04 2.20000e+05 0.00000e+00 0.00000e+00 0.00000e+00
0.00000e+00 0.00000e+00 0.00000e+00 8.50000e+04 0.00000e+00 0.00000e+00
0.00000e+00 0.00000e+00 0.00000e+00 0.00000e+00 7.20000e+04 0.00000e+00
0.00000e+00 0.00000e+00 0.00000e+00 0.00000e+00 0.00000e+00 7.20000e+04
VECTOr ORTHotropic 4.76997e-01 8.42490e-01 2.50369e-01 -3.85284e-01 -5.55979e-02 9.21122e-01
and similarly for user element is
MATErial 1000
user 11000 1 2 3
ka,mu,sy,ac,ee,ld,et,hh,sn,om,cr
I noticed the jammed material tag number.
-
In version 8.5 we were not writing this information out in this way and that is probably why it was working in the older version.
-
Quick Update: I made the changes in program/pmatin.f and reran different variants of BVP with material tags > 1000. The problem seems to be resolved. I finally have quadratic convergence. Thank you very much for all the support.