FEAP User Forum
FEAP => Parallel FEAP => Topic started by: arktik on June 27, 2021, 02:34:05 PM
-
Dear FEAP Admin,
There seems to be a bug in parallel FEAP which effects reading or creating very large meshes. The bug is triggered mostly when a material model with history is used. I tested Hex8 and Tet10 elements with PLAStic MISEs. A brief summary
1. When a node number exceeds 7 digits (>9999999), the element connectivity is fully jammed (seen in O-file), similar to the last bug Try this.
Edit program/pmatin.f and change format statement 2008 from
2. When a material model with history is used e.g. PLAStic MISEs, bug is triggered even when less than 7 digit nodes. Following error is throw.
[0]PETSC ERROR: ------------------------------------------------------------------------
[0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
[0]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
Attached please find a MWE where the parameter n for block mesh can be selected to reproduce the issue.
With Hex8 n>267 triggers the bug and for Tet10 n>196 triggers the bug. Interestingly, when only elastic model (no history array) is used, parallel FEAP can go on creating partitioned files for the very large mesh only limited by the computer memory (even with jammed nodal numbers).
Your support is highly appreciated. Sorry, I hate to be the frequent bearer of bugs :-[
-
Parfeap was designed long before problems of this size were thought of as possible.
There is a parallel partitioner available in one of the sub-folders to parfeap but I do not think it has been tested in a long time.
Have a look in the documentation to see if there is any information on it as it may be helpful, especially when the node graph
gets very large. If you can post an example of the jammed entries, we may be able to identify the format statement causing the
problem.
Is the history error occurring after partitioning? or before/during? If before/during then,
the issue with history occurs since when you start up parfeap/feap it prepares itself for also performing a computation. If that is
not going to occur, one can force the program to skip the allocation. I would suggest, creating a user macro (probably a user
mesh macro to make sure it is processed early) that sets a flag. Then in program/pcontr.f where the allocation for H1, H2, H3
occurs, you can skip the history allocation if your flag is set, else you do it and the program functions as normal.
If this is occurring during a parallel run, then it is possible that the nodes do not have enough memory? To debug, we will need
more information. Perhaps there is info in the O-files.
-
Looking at the parallel output routines, it seems that they check the number of digits needed and provide the needed space when writing out the parallel input files. So even if the numbers are jammed up in the regular O-file (which should be fixed), then parallel input will be ok.
If you use a feap generated flat file, the file should be ok as it uses i10 output fields.
Of course if you end up with node numbers reaching i10, then you can edit the format statement or, better, use the OUTMesh,BINAry option.
-
Thanks Prof. Govindjee for a prompt response.
1. The problem occurs only "during" partitioning. "Before" is ruled out as no segmentation fault occurs if the input file has no batch and INTEractive commands (i.e. no statements between END MESH and STOP).
2. About the parallel partition (parfeap/partition), I never use it as it doesn't work as expected (should have opened a separate forum topic but didn't). The partition performed with PARMetis and METIs have similar performance (when clocked with time command). PARMetis does not seem to be influenced by any number of cpus. This is tested with v85 and v86.
3. The binary file generated by OUTMesh,BINAry also doesn't seem to work. For ex, when I use parfeap/feap to open an input file which uses the the binary file with the following format, segmentation fault occurs.
BINAry,filename.bin
END
STOP
About tweaking program/pcontr.f and creating a user mesh macro, I will try to handle it later.
-
I have not tried the binary options, so I'm not sure if they are functional; I just mentioned it as a possibility.
Given what you have written, I think the modification that I mentioned where you avoid the history allocation
during partitioning is the best option.
-
Your suggested quick fix with program/pcontr.f unfortunately doesn't work. I receive the same segmentation fault as reported earlier in the first post. Just a thought: If there is no memory limit or allocation problem when no history arrays are needed and feap can handle very large meshes (e.g >30E6 Tet10 elements) only limited by computer memory, I wonder having not enough memory during allocation of history array is the underlying problem. As I said, for material models with history arrays, segmentation fault occurs for moderate meshes (e.g. <6E6 Tet10 elements).
-
There are limits imposed by array size for 4 byte integers. Also any access of mr(*) or hr(*) must be an 8 byte integer.
There may also be other limitations by dimensioning or formats, might be unintentional as we never anticipated the size of problems now being attempted!
-
Just to confirm, the problem with the history is occurring when you start parfeap/feap for the purpose of partitioning the mesh?
-
Assuming a Hex8 element with plasticity, NH1 = NH2 = 8 Gauss points x 7 history variables = 56.
Total length history array (H) per element is 112.
Max value possible with signed integer = (2^32)/2-1 = 2.147.483647 <-- This is maximum possible length of array H with int(kind=4).
Using 3d block command to mesh a unit cube with n elements per axis, we easily come to the limit of n=267.
Total length of history for entire problem (n=267) = n^3 x 112 = 2.131.826.256
For n>267 in this specific example wtih Hex8, segmentation fault occurs.
Similarly for Tet10 element where history array length (H) per element is 378, the limit is reached with n=196.
Total length of history for entire problem (n=196) = 3/4 x n^3 x 378 = 2.134.623.456
For n>196, segmentation fault occurs.
-
I tried Prof. Taylor's suggestion of setting ipr=1 in main/feap86.f to use integer(kind=8) arrays. The source code was fresh complied after setting appropriate flag (-fdefault-integer-8) in makefile.in and replacing cmem.c and cmemck.c in unix/memory with those from unix/largemem. Sadly, parfeap/feap does not execute after these changes. It throws a segmentation fault.
-
The interface of parfeap to petsc is implemented with the assumption that integer data type is of kind 4.
that is why parfeap can not be successfully compiled with "-fdefault-integer-8".
however the pointers for the FEAP arrays (np and up in pointers.h / nh1 etc in hdata.h) are declared as integer kind 8.
thus, there should be enough space for addressing components in large arrays.
maybe an kind 4 integer is involved somewhere in a subroutine?
-
iSw=1 does not set 8 byte integers, that requires setting compiler flags
Integers are 4 bytes; reals are 8 bytes in standard build. So if problems occur I suspect an int array could be too big
How biz is element connection array, ix(*)?
-
I have observed that the number of tangent components written to the o-file is negative when I had too many elements in the model and the integer*4 is overflowing.
I had not tried to include "show dict" at that time which maybe can show overflows in further arrays like the history stack.
-
Thank you JSorm for the insight. That means using integer (kind=8) is out of scope of the current FEAP version. Therefore setting ipr=1 in main/feap86.f and setting compiler flag -fdefault-integer-8 should be strictly avoided if parallel FEAP is the intended application.
Prof. Taylor, since for models without history parfeap/feap can create/handle meshes as large as allowed by computer memory, I don't think element connection array IX(*) is an issue. E.g. I could create 3d block with n=400 without any issues with elastic isotropic model. For your reference, at n=267 (#elements ~19E6) with IX length = 361.649.097. At n=400 (#elements = 64E6), IX is, however, not printed by show dict (format problem).
Below I have posted the output of show dict with n=400 for elastic model with Hex8 mesh
D i c t i o n a r y o f A r r a y s
Entry Array Array Array Array Pointer
Number Names Number Precn Length Value Type
1 DR 26 2193443603 17565994625359 Program
2 LD 34 1 168 1451745 Program
3 P 35 2 144 726987 Program
4 S 36 2 1152 727137 Program
5 PT 346 2 144 728295 Program
6 ST 347 2 1152 728445 Program
7 TL 39 2 8 724507 Program
8 UL 41 2 336 729603 Program
9 XL 44 2 24 729945 Program
10 D 25 2 501 729975 Program
11 IE 32 1 15 1458917 Program
12 IEDOF 240 1 24 1458945 Program
13 ID 31 1386887206 35131602360989 Program
14 IX 33 1********* 35130386359965 Program
15 NDTYP 190 1 64481201 35130321878685 Program
16 RIXT 100 1 64481201 35130257397405 Program
17 RBEN 181 1 64000000 35130193396381 Program
18 X 43 2193443603 17564903255375 Program
19 ANG 45 2 64481201 17564838774095 Program
20 ANGL 46 2 8 730515 Program
21 F 27 2386887206 17564451886415 Program
22 F0 28 2773774412 17563678111567 Program
23 FPRO 29 1386887206 35126969333405 Program
24 FTN 30 2773774412 17562710892879 Program
25 T 38 2 64481201 17562646411599 Program
26 U 40 2773774412 17561872636751 Program
27 NREN 89 1128962402 35123616308893 Program
28 EXTND 78 1 64481201 35123551827613 Program
29 NORMV 206 2193443603 17561582470991 Program
30 JP1 21 1192800399 35122972139165 Program
31 NODPT 254 1 64481201 35122907657885 Program
32 XADJN 252 1 64481202 35122266694301 Program
33 NODG 253 1********* 35120332257949 Program
Total memory used by FEAP:
Integer Arrays = 70818530
Real Arrays = *********
-
It is important to understand the purpose of pfeap=parfeap/feap. pfeap is mainly used for two purposes: (1) to partition the problem, (2) to preform a parallel solution. Secondarily, it can be used for special types of serial computations -- but I will not get into that.
For your purposes, you are using it for (1) and (2). The problem you have seems to be with (1). This is easily avoided. Here is one solution which I have tested and works.
1. Copy program/pcontr.f to parallel/pcontr.f
2. Edit parallel/pcontr.f as follows: (a) add the line include 'nohist.h'
(b) where the history allocation occurs branch around it based the flag nohist (which will be contained in nohist.h):
if(nohist) then
nhmax = 0
nh3max = 0
else
! Set up stress history addresses
call sethis(mr(np(32)),mr(np(33)),mr(np(181)),nie,nen,nen1,
& numel,nummat,prt)
endif
3. add the file 'nohist.h' to the parfeap folder and give it the contents:
logical nohist
common /nohist/ nohist
3. add a user mesh macro to the parfeap folder, umesh0.h, with the following contents
subroutine umesh0(tx,prt)
implicit none
include 'umac1.h'
include 'nohist.h'
character (len=15) :: tx(*)
logical :: prt,pcomp
! Set command
if(pcomp(uct,'mes0',4)) then ! Usual form
uct = 'nohi' ! Specify 'name'
nohist = .false.
elseif(ucount) then ! Count elements and nodes
elseif(urest.eq.1) then ! Read restart data
elseif(urest.eq.2) then ! Write restart data
else ! Perform user operation
nohist = .true.
endif
end subroutine umesh0
4. Edit the OBJECTS line of parfeap/makefile to include umesh0.o and pcontr.o
5. Rebuild pfeap
6. In the file you are trying to partition add the command NOHIstory just after the feap header lines. This will set the nohist flag to prevent the memory issues.
You should now be able to partition problems which are very large (in terms of history).
Note you will need to make sure that the number of partitions you choose is large enough, so that the history required in a single partition is feasible for your compute nodes. Otherwise the code will still crash when you go to make your parallel runs
If you really want to get to crazy large problems you can build feap with ipr.eq.1 but you will also have to re-build petsc to use large integers too.
If the number of equations gets very high you many need to expand some of the format statements in parfeap/pmacr7.F that write out the allocation data to the parallel input files. So before running your partitioned problem, carefully check the parallel input files to look for format issues where numbers are crowding together.
-
If you really want to get to crazy large problems you can build feap with ipr.eq.1 but you will also have to re-build petsc to use large integers too.
I had tried to compile parFEAP 8.4 and PETSc with integer*8.
But this is not working because the data size for the MPI commands is hard coded into parFEAP.
-
Thank you very much Prof. Govindjee for the solution. It worked flawlessly. I tested both the partitioning and the full solution of very large meshes with history (so far successful up to ~50E6 elements).
I haven't yet approached or tested a problem of the order of 1E8 or 1E9 elements. But I think correction in write statements and full integration with integer(kind=8) for PETsc (NB. JStorm's remark) should be added to the wish list for future releases.
-
@JStorm
Why do you think that is? FEAP itself only uses integer:: declarations, so the compile flag should take care of those. Its connection to PETSc is via PETSc variable declarations so those should be ok. The only thing that could perhaps be amiss is the direct MPI calls that use things like MPI_INT but if the PETSc build was done as 64-bit then I would imagine that those should be ok too (assume the MPI was configured and built together with the PETSc).
Do you see some other spots where this is a problem? If it is the MPI calls, then replacing the data type macro with PETSC_INT should fix that problem. Writing some small standalone test programs should sort the issue out quickly.
-
Dear Prof. Govindjee,
I took a look into my tests which I had performed on FEAP 8.4 about two years ago.
You are right. The PETSc interface is implemented via PETSc data type declarations.
PetscInt was successfully set to integer*8. MPICH was compiled with PETSc.
However, MPI_INT was still integer*4.
That was the point where a big integer MPI compilation could work.
On the other side, setting the data size in the MPI calls to either integer or introducing a FEAP_INT could be a better solution.
But the time for further tests and modifications was more then I could offer.
-
I am guessing that using PETSC_INT will then solve the problem. I will put that on the to do list.
-
great