FEAP User Forum

FEAP => Parallel FEAP => Topic started by: arktik on June 27, 2021, 02:34:05 PM

Title: Error reading and/or creating large meshes
Post by: arktik on June 27, 2021, 02:34:05 PM
Dear FEAP Admin,

There seems to be a bug in parallel FEAP which effects reading or creating very large meshes. The bug is triggered mostly when a material model with history is used. I tested Hex8 and Tet10 elements with PLAStic MISEs. A brief summary

  1. When a node number exceeds 7 digits (>9999999), the element connectivity is fully jammed (seen in O-file), similar to the last bug
Try this.

Edit program/pmatin.f and change format statement 2008 from

  2. When a material model with history is used e.g. PLAStic MISEs, bug is triggered even when less than 7 digit nodes. Following error is throw.
Code: [Select]
[0]PETSC ERROR: ------------------------------------------------------------------------
[0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
[0]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------

Attached please find a MWE where the parameter n for block mesh can be selected to reproduce the issue.

With Hex8 n>267 triggers the bug and for Tet10 n>196 triggers the bug. Interestingly, when only elastic model (no history array) is used, parallel FEAP can go on creating partitioned files for the very large mesh only limited by the computer memory (even with jammed nodal numbers).

Your support is highly appreciated. Sorry, I hate to be the frequent bearer of bugs  :-[
Title: Re: Error reading and/or creating large meshes
Post by: Prof. S. Govindjee on June 27, 2021, 04:30:07 PM
Parfeap was designed long before problems of this size were thought of as possible.

There is a parallel partitioner available in one of the sub-folders to parfeap but I do not think it has been tested in a long time.
Have a look in the documentation to see if there is any information on it as it may be helpful, especially when the node graph
gets very large.  If  you can post an example of the jammed entries, we may be able to identify the format statement causing the
problem.

Is the history error occurring after partitioning? or before/during?  If before/during then,
the issue with history occurs since when you start up parfeap/feap it prepares itself for also performing a computation.  If that is
not going to occur, one can force the program to skip the allocation.  I would suggest, creating a user macro (probably a user
mesh macro to make sure it is processed early) that sets a flag.  Then in program/pcontr.f where the allocation for H1, H2, H3
occurs, you can skip the history allocation if your flag is set, else you do it and the program functions as normal.

If this is occurring during a parallel run, then it is possible that the nodes do not have enough memory?  To debug, we will need
more information.  Perhaps there is info in the O-files.
Title: Re: Error reading and/or creating large meshes
Post by: Prof. S. Govindjee on June 27, 2021, 04:56:08 PM
Looking at the parallel output routines, it seems that they check the number of digits needed and provide the needed space when writing out the parallel input files.  So even if the numbers are jammed up in the regular O-file (which should be fixed), then parallel input will be ok. 

If you use a feap generated flat file, the file should be ok as it uses i10 output fields. 
Of course if you end up with node numbers reaching i10, then you can edit the format statement or, better, use the OUTMesh,BINAry option.
Title: Re: Error reading and/or creating large meshes
Post by: arktik on June 28, 2021, 12:44:01 AM
Thanks Prof. Govindjee for a prompt response.

1. The problem occurs only "during" partitioning. "Before" is ruled out as no segmentation fault occurs if the input file has no batch and INTEractive commands  (i.e. no statements between END MESH and STOP).

2. About the parallel partition (parfeap/partition), I never use it as it doesn't work as expected (should have opened a separate forum topic but didn't). The partition performed with PARMetis and METIs have similar performance (when clocked with time command). PARMetis does not seem to be influenced by any number of cpus. This is tested with v85 and v86.

3. The binary file generated by OUTMesh,BINAry also doesn't seem to work. For ex, when I use parfeap/feap to open an input file which uses the the binary file with the following format, segmentation fault occurs.
Code: [Select]
BINAry,filename.bin
END
STOP

About tweaking program/pcontr.f and creating a user mesh macro, I will try to handle it later.
Title: Re: Error reading and/or creating large meshes
Post by: Prof. S. Govindjee on June 28, 2021, 10:47:58 PM
I have not tried the binary options, so I'm not sure if they are functional; I just mentioned it as a possibility.

Given what you have written, I think the modification that I mentioned where you avoid the history allocation
during partitioning is the best option.
Title: Re: Error reading and/or creating large meshes
Post by: arktik on June 29, 2021, 02:04:56 PM
Your suggested quick fix with program/pcontr.f unfortunately doesn't work. I receive the same segmentation fault as reported earlier in the first post. Just a thought: If there is no memory limit or allocation problem when no history arrays are needed and feap can handle very large meshes (e.g >30E6 Tet10 elements) only limited by computer memory, I wonder having not enough memory during allocation of history array is the underlying problem. As I said, for material models with history arrays, segmentation fault occurs for moderate meshes (e.g.  <6E6 Tet10 elements).
Title: Re: Error reading and/or creating large meshes
Post by: Prof. R.L. Taylor on June 29, 2021, 04:07:52 PM
There are limits imposed by array size for 4 byte integers.  Also any access of mr(*) or hr(*) must be an 8 byte integer.

There may also be other limitations by dimensioning or formats, might be unintentional as we never anticipated the size of problems now being attempted!
Title: Re: Error reading and/or creating large meshes
Post by: Prof. S. Govindjee on June 29, 2021, 05:40:25 PM
Just to confirm, the problem with the history is occurring when you start parfeap/feap for the purpose of partitioning the mesh?
Title: Re: Error reading and/or creating large meshes
Post by: arktik on June 30, 2021, 02:44:01 AM
Assuming a Hex8 element with plasticity, NH1 = NH2 = 8 Gauss points x 7 history variables = 56.
Total length history array (H) per element is 112.
Max value possible with signed integer = (2^32)/2-1 = 2.147.483647 <-- This is maximum possible length of array H with int(kind=4).

Using 3d block command to mesh a unit cube with n elements per axis, we easily come to the limit of n=267.

Total length of history for entire problem (n=267) = n^3 x 112 =  2.131.826.256
For n>267 in this specific example wtih Hex8, segmentation fault occurs.

Similarly for Tet10 element where history array length (H) per element is 378, the limit is reached with n=196.
Total length of history for entire problem (n=196) = 3/4 x n^3 x 378 = 2.134.623.456

For n>196, segmentation fault occurs.
Title: Re: Error reading and/or creating large meshes
Post by: arktik on June 30, 2021, 02:52:45 AM
I tried Prof. Taylor's suggestion of setting ipr=1 in main/feap86.f to use integer(kind=8) arrays. The source code was fresh complied after setting appropriate flag (-fdefault-integer-8) in makefile.in and replacing cmem.c and cmemck.c in unix/memory with those from unix/largemem. Sadly, parfeap/feap does not execute after these changes. It throws a segmentation fault.
Title: Re: Error reading and/or creating large meshes
Post by: JStorm on June 30, 2021, 04:11:27 AM
The interface of parfeap to petsc is implemented with the assumption that integer data type is of kind 4.
that is why parfeap can not be successfully compiled with "-fdefault-integer-8".

however the pointers for the FEAP arrays (np and up in pointers.h / nh1 etc in hdata.h) are  declared as integer kind 8.
thus, there should be enough space for addressing components in large arrays.
maybe an kind 4 integer is involved somewhere in a subroutine?
Title: Re: Error reading and/or creating large meshes
Post by: Prof. R.L. Taylor on June 30, 2021, 04:13:12 AM
iSw=1 does not set 8 byte integers, that requires setting compiler flags

Integers are 4 bytes; reals are 8 bytes in standard build. So if problems occur I suspect an int array could be too big

How biz is element connection array, ix(*)?

Title: Re: Error reading and/or creating large meshes
Post by: JStorm on June 30, 2021, 04:25:35 AM
I have observed that the number of tangent components written to the o-file is negative when I had too many elements in the model and the integer*4 is overflowing.
I had not tried to include "show dict" at that time which maybe can show overflows in further arrays like the history stack.
Title: Re: Error reading and/or creating large meshes
Post by: arktik on June 30, 2021, 05:25:23 AM
Thank you JSorm for the insight. That means using integer (kind=8) is out of scope of the current FEAP version. Therefore setting ipr=1 in main/feap86.f and setting compiler flag -fdefault-integer-8 should be strictly avoided if parallel FEAP is the intended application.

Prof. Taylor, since for models without history parfeap/feap can create/handle meshes as large as allowed by computer memory, I don't think element connection array IX(*) is an issue. E.g. I could create 3d block with n=400 without any issues with elastic isotropic model. For your reference, at n=267 (#elements ~19E6) with IX length = 361.649.097. At n=400 (#elements = 64E6), IX is, however, not printed by show dict (format problem).

Below I have posted the output of show dict with n=400 for elastic model with Hex8 mesh
Code: [Select]
     D i c t i o n a r y    o f   A r r a y s

           Entry  Array   Array  Array    Array            Pointer
          Number  Names  Number  Precn   Length          Value Type
              1   DR        26      2193443603  17565994625359 Program
              2   LD        34      1      168         1451745 Program
              3   P         35      2      144          726987 Program
              4   S         36      2     1152          727137 Program
              5   PT       346      2      144          728295 Program
              6   ST       347      2     1152          728445 Program
              7   TL        39      2        8          724507 Program
              8   UL        41      2      336          729603 Program
              9   XL        44      2       24          729945 Program
             10   D         25      2      501          729975 Program
             11   IE        32      1       15         1458917 Program
             12   IEDOF    240      1       24         1458945 Program
             13   ID        31      1386887206  35131602360989 Program
             14   IX        33      1*********  35130386359965 Program
             15   NDTYP    190      1 64481201  35130321878685 Program
             16   RIXT     100      1 64481201  35130257397405 Program
             17   RBEN     181      1 64000000  35130193396381 Program
             18   X         43      2193443603  17564903255375 Program
             19   ANG       45      2 64481201  17564838774095 Program
             20   ANGL      46      2        8          730515 Program
             21   F         27      2386887206  17564451886415 Program
             22   F0        28      2773774412  17563678111567 Program
             23   FPRO      29      1386887206  35126969333405 Program
             24   FTN       30      2773774412  17562710892879 Program
             25   T         38      2 64481201  17562646411599 Program
             26   U         40      2773774412  17561872636751 Program
             27   NREN      89      1128962402  35123616308893 Program
             28   EXTND     78      1 64481201  35123551827613 Program
             29   NORMV    206      2193443603  17561582470991 Program
             30   JP1       21      1192800399  35122972139165 Program
             31   NODPT    254      1 64481201  35122907657885 Program
             32   XADJN    252      1 64481202  35122266694301 Program
             33   NODG     253      1*********  35120332257949 Program
          Total memory used by FEAP:
                    Integer Arrays =  70818530
                    Real    Arrays = *********
Title: Re: Error reading and/or creating large meshes
Post by: Prof. S. Govindjee on June 30, 2021, 02:49:08 PM
It is important to understand the purpose of pfeap=parfeap/feap.  pfeap is mainly used for two purposes: (1) to partition the problem, (2) to preform a parallel solution.   Secondarily, it can be used for special types of serial computations -- but I will not get into that.

For your purposes, you are using it for (1) and (2).  The problem you have seems to be with (1).  This is easily avoided.  Here is one solution which I have tested and works.

1. Copy program/pcontr.f to parallel/pcontr.f
2. Edit parallel/pcontr.f as follows:  (a) add the line
Code: [Select]
include 'nohist.h' (b) where the history allocation occurs branch around it based the flag nohist (which will be contained in nohist.h):
Code: [Select]
        if(nohist) then
          nhmax  = 0
          nh3max = 0
        else
!       Set up stress history addresses
          call sethis(mr(np(32)),mr(np(33)),mr(np(181)),nie,nen,nen1,
     &              numel,nummat,prt)
        endif
3. add the file 'nohist.h' to the parfeap folder and give it the contents:
Code: [Select]
      logical         nohist
      common /nohist/ nohist
3. add a user mesh macro to the parfeap folder, umesh0.h, with the following contents
Code: [Select]
      subroutine umesh0(tx,prt)
      implicit  none

      include  'umac1.h'
      include  'nohist.h'

      character (len=15) :: tx(*)

      logical       :: prt,pcomp

!     Set command

      if(pcomp(uct,'mes0',4)) then      ! Usual    form
        uct = 'nohi'                    ! Specify 'name'
        nohist = .false.
      elseif(ucount) then               ! Count elements and nodes

      elseif(urest.eq.1) then           ! Read  restart data

      elseif(urest.eq.2) then           ! Write restart data

      else                              ! Perform user operation
        nohist = .true.
      endif

      end subroutine umesh0
4. Edit the OBJECTS line of parfeap/makefile to include umesh0.o and pcontr.o
5. Rebuild pfeap
6. In the file you are trying to partition add the command NOHIstory just after the feap header lines.  This will set the nohist flag to prevent the memory issues.

You should now be able to partition problems which are very large (in terms of history).

Note you will need to make sure that the number of partitions you choose is large enough, so that the history required in a single partition is feasible for your compute nodes.  Otherwise the code will still crash when you go to make your parallel runs

If you really want to get to crazy large problems you can build feap with ipr.eq.1 but you will also have to re-build petsc to use large integers too.

If the number of equations gets very high you many need to expand some of the format statements in parfeap/pmacr7.F that write out the allocation data to the parallel input files.  So before running your partitioned problem, carefully check the parallel input files to look for format issues where numbers are crowding together.
Title: Re: Error reading and/or creating large meshes
Post by: JStorm on June 30, 2021, 10:37:35 PM
If you really want to get to crazy large problems you can build feap with ipr.eq.1 but you will also have to re-build petsc to use large integers too.

I had tried to compile parFEAP 8.4 and PETSc with integer*8.
But this is not working because the data size for the MPI commands is hard coded into parFEAP.
Title: Re: Error reading and/or creating large meshes
Post by: arktik on July 01, 2021, 06:31:41 AM
Thank you very much Prof. Govindjee for the solution. It worked flawlessly. I tested both the partitioning and the full solution of very large meshes with history (so far successful up to ~50E6 elements).

I haven't yet approached or tested a problem of the order of 1E8 or 1E9 elements. But I think correction in write statements and full integration with integer(kind=8) for PETsc (NB. JStorm's remark) should be added to the wish list for future releases.
Title: Re: Error reading and/or creating large meshes
Post by: Prof. S. Govindjee on July 01, 2021, 08:58:08 AM
@JStorm

Why do you think that is?  FEAP itself only uses integer:: declarations, so the compile flag should take care of those.  Its connection to PETSc is via PETSc variable declarations so those should be ok.  The only thing that could perhaps be amiss is the direct MPI calls that use things like MPI_INT but if the PETSc build was done as 64-bit then I would imagine that those should be ok too (assume the MPI was configured and built together with the PETSc).

Do you see some other spots where this is a problem?  If it is the MPI calls, then replacing the data type macro with PETSC_INT should fix that problem.  Writing some small standalone test programs should sort the issue out quickly.
Title: Re: Error reading and/or creating large meshes
Post by: JStorm on July 01, 2021, 12:26:21 PM
Dear Prof. Govindjee,

I took a look into my tests which I had performed on FEAP 8.4 about two years ago.
You are right. The PETSc interface is implemented via PETSc data type declarations.
PetscInt was successfully set to integer*8. MPICH was compiled with PETSc.
However, MPI_INT was still integer*4.

That was the point where a big integer MPI compilation could work.
On the other side, setting the data size in the MPI calls to either integer or introducing a FEAP_INT could be a better solution.
But the time for further tests and modifications was more then I could offer.

Title: Re: Error reading and/or creating large meshes
Post by: Prof. S. Govindjee on July 01, 2021, 12:50:02 PM
I am guessing that using PETSC_INT will then solve the problem.  I will put that on the to do list.
Title: Re: Error reading and/or creating large meshes
Post by: JStorm on July 01, 2021, 01:28:32 PM
great