FEAP User Forum

FEAP => Parallel FEAP => Topic started by: arktik on April 04, 2021, 06:01:57 AM

Title: Convergence problems with v8.6 (very slow compared to v8.5)
Post by: arktik on April 04, 2021, 06:01:57 AM
Dear FEAP Admin,

The convergence rate of parallel FEAP v8.6 (8.6.1i) seems to be suboptimal when compared to v8.5 (8.5.2i). I checked with different boundary value problems (purely mechanical). Here is what I found out:
The check was performed with original source code without compiling any user-defined modifications. For your reference, I am attaching the test examples used for the above conclusions. Please let us know what is happening and how it can resolved.

For one of the more complex problems (~2000 material tags) not attached in zip directory, v8.6 diverged to NaN, where v8.5 gave expected results.

Additional Info:

v8.5 is installed with the following major dependencies a) GCC-7.3.1 b) OpenMPI-3.1.1 c) PETSc-3.11.1
v8.6 is installed with the following major dependencies a) GCC-7.3.1 b) OpenMPI-4.0.4 c) PETSc-3.13.2


Sincerely
Title: Re: Convergence problems with v8.6 (very slow compared to v8.5)
Post by: Prof. S. Govindjee on April 05, 2021, 12:46:33 AM
Thanks for the sample files.  We will have a look.

However, one quick question.  Do you know if the partitionings are different from 8.5 and 8.6?
Title: Re: Convergence problems with v8.6 (very slow compared to v8.5)
Post by: arktik on April 05, 2021, 01:28:38 AM
Thank you Prof. Govindjee for a quick response. I am not really sure if I understood what you mean by "if partitioning is different".  Are you referring to this topic http://feap.berkeley.edu/forum/index.php?topic=2436.0 (http://feap.berkeley.edu/forum/index.php?topic=2436.0)? This has been taken care of while performing the above test.

In the tested examples, the partitioning is done with
Code: [Select]
GRAPh NODE <nproc>
OUTDomains AIJ 1

as shown in each example input file (and explained in section 1.3.1 of parmanual_86.pdf). I think this should lead to identical partitions for both versions(?).
Title: Re: Convergence problems with v8.6 (very slow compared to v8.5)
Post by: Prof. S. Govindjee on April 05, 2021, 01:54:09 AM
Yes, that is precisely the question.  Do you know if the partitioned input files contain the same distribution of nodes?
Title: Re: Convergence problems with v8.6 (very slow compared to v8.5)
Post by: arktik on April 05, 2021, 06:58:03 AM
I checked the output files. The partitioning is identical i.e. number of nodal points and number of elements generated in each of the partitioned files are same for both the versions (for all examples).
Title: Re: Convergence problems with v8.6 (very slow compared to v8.5)
Post by: Prof. S. Govindjee on April 05, 2021, 11:48:23 AM
Thank you for that diagnostic.  If the partitionings are the same then it is hard to imagine why the convergence behavior is so different.
Are you sure that both versions have been compiled with the same options (debugging or not; and same level of optimization)?

If those points are the same, then can you post the output of the petsc log and ksp monitor (for one of your examples)?
Title: Re: Convergence problems with v8.6 (very slow compared to v8.5)
Post by: arktik on April 06, 2021, 07:43:50 AM
I have attached the ksp monitor outcome and log view for both versions only for ex2. I am a bit baffled that from these petsc generated reports nothing is really apparent as they are almost identical. Both versions were compiled with the similar options/flags (as seen in log views).
Title: Re: Convergence problems with v8.6 (very slow compared to v8.5)
Post by: Prof. S. Govindjee on April 06, 2021, 05:14:38 PM
Your KSP monitor logs look to be about the same (as do your overall PETSC logs)  I was under the impression that you were seeing different iterative convergence with the KSP solver?
Title: Re: Convergence problems with v8.6 (very slow compared to v8.5)
Post by: arktik on April 06, 2021, 11:31:07 PM
Sorry for the confusion. By slower rate of convergence, I meant the values printed by FEAP in its own log files e.g. starting with Lxxx_0001 and so on. KSP monitor and petsc log files apparently show similar behavior for both versions.
Title: Re: Convergence problems with v8.6 (very slow compared to v8.5)
Post by: Prof. S. Govindjee on April 07, 2021, 12:51:10 PM
Ok.  Thanks for the clarification.  Can you post the L-files for your ver85 and ver86 runs.
Title: Re: Convergence problems with v8.6 (very slow compared to v8.5)
Post by: arktik on April 07, 2021, 11:52:23 PM
For each of the example files above, the L-files for v85 and v86 are attached. In the meantime, I also tested if different openmpi and petsc versions used in the testing above could be the source of the problem. However, that is not the case. For v85 and v86 compiled with identical petsc and openmpi versions, this problem still exists.
Title: Re: Convergence problems with v8.6 (very slow compared to v8.5)
Post by: Prof. S. Govindjee on April 08, 2021, 12:38:18 AM
Thanks.
I'll have a deeper look.
Title: Re: Convergence problems with v8.6 (very slow compared to v8.5)
Post by: JStorm on April 08, 2021, 06:19:38 AM
The number of Newton iterations is same in your examples.
The only difference that I have recognized is that the final residuum norm with FEAP86 is always larger then with FEAP85.
But convergence was accepted by FEAP based on the energy norm which was small enough with both version.

It would be interesting to know why FEAP85 has achieved a better residuum norm in all three examples at all steps.
However, convergence behaviour looks ok for me.
Title: Re: Convergence problems with v8.6 (very slow compared to v8.5)
Post by: Prof. S. Govindjee on April 08, 2021, 01:56:41 PM
If one looks at the energy norms from the two codes they are essentially identical sometimes even to 15 digits.  The issue with the residual norm appears to be related to some changes that have been made to the serial code which are not quite correctly implemented in the parallel code.  A patch is being developed.
Title: Re: Convergence problems with v8.6 (very slow compared to v8.5)
Post by: arktik on April 08, 2021, 10:12:56 PM
Thanks Prof. Govindjee for the assessment. Yes, I also noticed that the energy norm was almost identical. As I mentioned, for a more complex problem (the tested examples were very simplified cases), ver86 simply fails to show convergence. I am assuming, it is not just the incorrect calculation of the residuum norm but the tangent modulus itself is involved?! Please let me know if you want the more complicated test case with ~2000 material tags (> 6x10^6 DOFs) as well.
Title: Re: Convergence problems with v8.6 (very slow compared to v8.5)
Post by: arktik on June 13, 2021, 08:57:29 AM
Dear FEAP Admin,

I have tested the new update 8.6.1j. The slow convergence problems reported above (in comparison with v8.5) seems to be resolved. Thanks for the support.

However, there seems to be a problem with the log files in parallel. Except the first one Lxxx_0001, log files from other cpus (e.g Lxxx_0002 ... Lxxx_000n) have no output (logs for solution steps are missing).
Title: Re: Convergence problems with v8.6 (very slow compared to v8.5)
Post by: Prof. S. Govindjee on June 13, 2021, 03:22:26 PM
Thanks.  I have reproduced the 'bug'; though I will note the files should be the same -- except for maybe the timing values on the end of each line.
Title: Re: Convergence problems with v8.6 (very slow compared to v8.5)
Post by: Prof. S. Govindjee on June 13, 2021, 03:28:49 PM
Actually this is a 'feature' to save disk I/O since the information was the same in all the files.  In pmacr2.f, there is now a check that rank.eq.0 before writing to the Log file.
Title: Re: Convergence problems with v8.6 (very slow compared to v8.5)
Post by: arktik on June 14, 2021, 03:00:35 AM
I totally agree that the other L-files need not be fully printed. Their size gets really large e.g. in case of transient explicit problems over 100s of CPUs. Did not know that this is now an intended feature. Thanks!
Title: Re: Convergence problems with v8.6 (very slow compared to v8.5)
Post by: arktik on June 16, 2021, 03:40:14 AM
In the meantime, I also checked the new version 8.6.1j with a more complicated problem (as mentioned in the first post in this topic). This problem has ~2000 material tags (each one an orthotropic crystal having unique orientation) and ~6 million DOFs. Unfortunately, the latest update also does not show convergence. As I said, the previous version (v8.5.2.i) works correctly.  The following error message is shown
Code: [Select]
NO CONVERGENCE REASON:  Iterations exceeded
Only FEAP standard features are used to solve this problem. I suspect the simple checks performed previously were not enough to trigger a possible bug.
Title: Re: Convergence problems with v8.6 (very slow compared to v8.5)
Post by: Prof. S. Govindjee on June 16, 2021, 12:23:38 PM
Hmmm...this is going to be challenging to debug with 6M dof.

As I understand, you now have 8.5.2i and 8.6.1j using the same version of openmpi and petsc.

(1) Can you (re)post your petsc logs and the L files from both versions with your 6M dof test case.

(2) Also which FEAP material model are you using?

(3) Do you know if the results are the same up the time step before the problem arises?
Title: Re: Convergence problems with v8.6 (very slow compared to v8.5)
Post by: arktik on June 17, 2021, 03:55:52 AM
8.5.2i and 8.6.1j are not installed with the same version of openmpi and petsc (as mentioned in first post). It's hard to say if that plays any role since 8.6.1j exhibits no other issues when compared to 8.5.2i installation. Regarding the current problem.
Title: Re: Convergence problems with v8.6 (very slow compared to v8.5)
Post by: Prof. S. Govindjee on June 17, 2021, 04:37:21 PM
Looking at the information you have provided, I see that the solution is getting off to a bad start without even trying a solution.

You can see that the very first residual that you are computing between the two programs is different and this should not be the case -- though I will point out that the technical details of how those residuals are computed differs between 8.5 and 8.6.  Most likely the tangents are different too.

To help focus in it will be helpful to know

(1)  the exact lines you are using to partition your equations

(2)  what happens if you just use one material and not 2000.

(3)  can you start serial FEAP on your problem and run FORM to get the expected residual?  Note this expected residual will be the residual that one expects in parallel if you have output the parallel files using OUTD,AIJ and not OUTD,AIJ,1



Title: Re: Convergence problems with v8.6 (very slow compared to v8.5)
Post by: arktik on June 21, 2021, 01:58:19 PM
My recent troubleshooting didn't give any conclusive results. To your points,

1. Partition is done with
Code: [Select]
BATCh
 GRAPh NODE ncpu
 OUTDomain aij 1 3
END
I used non-flat as well as flat file. Both give diverged solution.

2. I created a block mesh (roughy same DOF) with one material tag, the solution converges and yields correct solution! The original problem uses INCLude statement to get discretized geometry (coordinates and elements) of 2000 grains. May be there is something happening here in new release?

3. The problem is too large that serial FEAP throws memory error (also with DIREct SPARse).

Title: Re: Convergence problems with v8.6 (very slow compared to v8.5)
Post by: Prof. R.L. Taylor on June 21, 2021, 02:14:14 PM
In your material specification, are all the materials different, or only their parameters?  Do they use history variables? It seems that changes in 8.5 to 8.6 may have something to do with the number of material sets.

Did you try CG with the serial version?
Title: Re: Convergence problems with v8.6 (very slow compared to v8.5)
Post by: arktik on June 22, 2021, 05:07:27 AM
All materials are same (e.g. testing done with elastic isotropic) with different material constants. However, the simplified version (linear elastic isotropic - no history) does not work either for a problem with ca. 2000 'elastic isotropic' grains having 2x10^ DOF.

I also ran serial FEAP (v8.6.1j) with ITERATIOn BPCG. It helped a bit. From **Diverged due to Indefinite Matrix** or *NO Convergence: ITERATIONS exceeded* messages, BPCG shows convergence although very slow (e.g. 10 Newton iterations achieve R_norm ~ 1E-5).

I found out that another BVP which uses just one material tag (a user element) with ca. 10x10^6 DOF does not converge with v8.6.1j. The same BVP and user element works without trouble with v8.5.2i.
Title: Re: Convergence problems with v8.6 (very slow compared to v8.5)
Post by: Prof. R.L. Taylor on June 22, 2021, 07:23:30 AM
Just out of curiosity, what happens if you run the command CHECk?
The mystery is still why 8.5 works and 8.6 does not?
Title: Re: Convergence problems with v8.6 (very slow compared to v8.5)
Post by: arktik on June 22, 2021, 08:18:23 AM
CHECk shows no red flags. For your information, here is the output of the test problem
Code: [Select]
Restrained Degree-of-Freedom Check
               DOF   Fixed
          ----------------
                 1  816827
                 2  807531
                 3  808627
          ----------------
                On 2000376 Nodes

A brief summary of findings so far:
Overall, I am suspecting some issue with using INCLude statement to import coordinates and elements (created for FEAP from third-party programs). But I may be totally wrong!
Title: Re: Convergence problems with v8.6 (very slow compared to v8.5)
Post by: Prof. R.L. Taylor on June 22, 2021, 09:56:43 AM
The only prob with import is if too many digits for numbers, more than. 15
When you did CHECk were you using a FEAP element or one of yours.  I think check would catch problems by an include
Title: Re: Convergence problems with v8.6 (very slow compared to v8.5)
Post by: arktik on June 22, 2021, 10:16:46 AM
The CHECk was performed with the standard library element. The digits for numbers in imported mesh are less than 15.
Title: Re: Convergence problems with v8.6 (very slow compared to v8.5)
Post by: Prof. S. Govindjee on June 22, 2021, 11:19:59 AM
One thought is that there is a memory over write error someplace that is being exposed by the large number of includes or large number of materials.

I would be good to first isolate which is the cause, materials or includes.

Then, as painful as it may be, I would run the code with valgrind to get for something getting clobbered.  The fact that this does not appear in 8.5 could be a fluke of how the memory blocks are being assigned.
Title: Re: Convergence problems with v8.6 (very slow compared to v8.5)
Post by: JStorm on June 22, 2021, 12:59:25 PM
If the issue is caused by a memory violation, then this can be perhaps tested via valgrind at a constrained problem.
All element and material routines would be executed (including the memory bug) but the large system of equations can be avoided by fixing all DOFs via boundary conditions.
Title: Re: Convergence problems with v8.6 (very slow compared to v8.5)
Post by: Prof. S. Govindjee on June 22, 2021, 04:00:42 PM
Hard to say if this suggestion will work, since the memory corruption could be to the tangent's memory...
Title: Re: Convergence problems with v8.6 (very slow compared to v8.5)
Post by: arktik on June 24, 2021, 08:46:42 AM
My ongoing troubleshooting with parallel v8.6.1j has shed some new light on the possible origins of the bug. I haven't yet incorporated debugging with valgrind. So far it's more of a mechanistic debugging  :-\

1. Two identical BVPs (DOF>1E6) are tested: one where mesh is generated with FEAP BLOCk and second where mesh is imported with INCLude using 3rd party programs. Both work correctly with standard library as well as user element. Therefore, a potential bug in INCLUde can be ruled out.

2. Partitioning done with v8.5.2 is slightly different from v8.6.1. E.g. v8.5.2 prints EREGions in partitioned files which is missing in v8.6.1.

3. Series of identical BVPs with increasing number of grains (=material tags) are performed. When grains (material tags) > 999, user element (solid3d with ndf=4) throws the following error
Code: [Select]
     Material Number1000: Element Type: user           : ELMT =***
     Element Material Set =   1
  *ERROR* ELMLIB: Element:     0, type number*** input, isw =  1
 RANK =   0
Feap standard library (3d elastic orthotropic) does not throw this error but simply stops converging.

4. For grains (material tags) <=999 standard library and user elements work correctly doesn't matter how many DOF.

Somehow the argument jel to program/elmlib.f seems to be corrupted when material tags > 999. One possible explanation as to how this corruption leads to divergence is that elements with material tags > 999 have garbage properties. This I haven't tested yet. 
Title: Re: Convergence problems with v8.6 (very slow compared to v8.5)
Post by: Prof. S. Govindjee on June 24, 2021, 12:25:46 PM
If you look in the 8.6 partition files you should find EREGions defined (just not in all of them).  I'll have a look at the jel issue.
Title: Re: Convergence problems with v8.6 (very slow compared to v8.5)
Post by: Prof. S. Govindjee on June 24, 2021, 12:38:09 PM
You should change format statement 4000 in elmlib.f,  i3 --> i5 so we can see what is actually in jel.
Note, jel should fit in an i3 format.  When using user elements there are only up to 50 user elements allowed, elmt01 through elmt50.  Feap's internal elements have negative numbers for jel.

The other thing to note is that the problem in point 3 is occurring early on.  isw is equal to 1.  so you are still in the input stage.
Title: Re: Convergence problems with v8.6 (very slow compared to v8.5)
Post by: Prof. S. Govindjee on June 24, 2021, 12:55:29 PM
Can you post a sample of what the material cards look like in the partitioned files? for material 1000 or higher?
Title: Re: Convergence problems with v8.6 (very slow compared to v8.5)
Post by: Prof. S. Govindjee on June 24, 2021, 01:11:00 PM
Another question:  How many parameters are you saving into d( ) or ud( ) in your user element?
Title: Re: Convergence problems with v8.6 (very slow compared to v8.5)
Post by: Prof. S. Govindjee on June 24, 2021, 01:26:56 PM
Partial good news.  I have created a small problem with 32x32 mesh with 1024 material sets that fails in the way you are seeing using FEAP elements (i.e. no convergence).  This will help debugging from our side.
Title: Re: Convergence problems with v8.6 (very slow compared to v8.5)
Post by: Prof. S. Govindjee on June 24, 2021, 01:33:25 PM
For the record, here are the files.  Serial files and a 4-partitioning of them.
Title: Re: Convergence problems with v8.6 (very slow compared to v8.5)
Post by: Prof. S. Govindjee on June 24, 2021, 01:40:53 PM
I have also been able to show that this is not an issue of partitioning.  If I make a single partition (parallel) input file it also fails and using a director solver I get a zero pivot error.
Title: Re: Convergence problems with v8.6 (very slow compared to v8.5)
Post by: Prof. S. Govindjee on June 24, 2021, 01:47:51 PM
I see the problem  :(

The format statement for writing out the material cards to the parallel files (also for creating flat files from serial FEAP) is incorrect.  When you have large numbers of materials there is not enough room and two fields get jammed together.

I work up a patch later today.
Title: Re: Convergence problems with v8.6 (very slow compared to v8.5)
Post by: Prof. S. Govindjee on June 24, 2021, 02:08:33 PM
Try this.

Edit program/pmatin.f and change format statement 2008 from
Code: [Select]
2008  format(2x,a15,1x,15i4:/(16i4))
to
Code: [Select]
2008  format(2x,a15,1x,15i5:/(16i4))
This is a quick fix (I hope).

Note you will need to rebuild the FEAP archive and then rebuild your serial and parallel feap executables.
Title: Re: Convergence problems with v8.6 (very slow compared to v8.5)
Post by: arktik on June 24, 2021, 02:13:49 PM
Thank you Prof. Govindjee for the prompt confirmation and the solution :). I will test it tomorrow.

By the way, input cards (printed in partitioned input files) for the FEAP element (number 1000) is:
Code: [Select]
MATErial    1000
  SOLId              01000   1   2   3
    ELAStic MODUli 6
    1.76000e+05 9.10000e+04 6.80000e+04 0.00000e+00 0.00000e+00 0.00000e+00
    9.10000e+04 1.75000e+05 6.80000e+04 0.00000e+00 0.00000e+00 0.00000e+00
    6.80000e+04 6.80000e+04 2.20000e+05 0.00000e+00 0.00000e+00 0.00000e+00
    0.00000e+00 0.00000e+00 0.00000e+00 8.50000e+04 0.00000e+00 0.00000e+00
    0.00000e+00 0.00000e+00 0.00000e+00 0.00000e+00 7.20000e+04 0.00000e+00
    0.00000e+00 0.00000e+00 0.00000e+00 0.00000e+00 0.00000e+00 7.20000e+04
    VECTOr ORTHotropic 4.76997e-01 8.42490e-01 2.50369e-01 -3.85284e-01 -5.55979e-02 9.21122e-01
and similarly for user element is
Code: [Select]
MATErial    1000
  user               11000   1   2   3
    ka,mu,sy,ac,ee,ld,et,hh,sn,om,cr

I noticed the jammed material tag number. 
Title: Re: Convergence problems with v8.6 (very slow compared to v8.5)
Post by: Prof. S. Govindjee on June 24, 2021, 03:25:46 PM
In version 8.5 we were not writing this information out in this way and that is probably why it was working in the older version.
Title: Re: Convergence problems with v8.6 (very slow compared to v8.5)
Post by: arktik on June 25, 2021, 06:39:20 AM
Quick Update: I made the changes in program/pmatin.f and reran different variants of BVP with material tags > 1000. The problem seems to be resolved. I finally have quadratic convergence. Thank you very much for all the support.