Hi all,
I got some trouble compiling a finite element program called charles.
I used
GNU Pascal version 20020910, based on gcc-2.95.2 19991024 (release).
to compile the program with profiling using the switch -pg. Because the program took an unusual amount of time running, I made a profile. Profiling the run yields that Preparedisposepointer uses exceedingly long times. This is the result for a fairly small problem, used to get output for this mail. I have seen 95% time use for Preparedisposepointer, but lost that profile testing other versions of gpc, and I am not patient enough to regenerate it, now that this output also shows the problem:
/hole 17 % gprof `which charles` | more Flat profile:
Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 59.13 3.14 3.14 Preparedisposepointer 3.20 3.31 0.17 _p_read_longreal 3.20 3.48 0.17 _p_trim ...
(sorry for the crappy lay-out)
Charles, compiled by an older version (see below) of gpc (seems to) works fine, but suffers from a memory leak in the file-io routines.
(older version of gpc) /hole 20 % /usr/tm/bin/gpc --version 2.95.2 19991024 (release)
I was not able to locate the difference in the pascal sources that explains why Preparedisposepointer now needs so much time. However, I suspect FreeMemPtr^ (in rts/heap.pas) does some things. A colleague mentioned that it might be a lot of moving in memory to prevent fragmentation.
I tried to create a test program that creates a linked list of integers and disposes it. Then the allocation and de-allocation times are similar in magnitude:
Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 44.07 1.30 1.30 1 1.30 1.30 pascal_main_program 15.93 1.77 0.47 _p_new 10.85 2.09 0.32 Preparedisposepointer 10.51 2.40 0.31 _p_dispose
Using 8 % gpc --version gpc 20030209, based on gcc-3.2.1 does not change much: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 46.82 1.40 1.40 1 1.40 1.40 pascal_main_program 16.39 1.89 0.49 _p_dispose 13.04 2.28 0.39 _p_new 9.36 2.56 0.28 Preparedisposepointer 6.69 2.76 0.20 Addheaprange
Still allocation and deallocation take similar times
Now I have the following questions: Any idea why deallocation takes forever in the FE program as compared to allocation? Maybe the reason is that the data-structure is slightly more complicated than a linked list of integers ;-) How can I isolate the problem? What kind of feedback can I provide you with to isolate and solve the problem?
Gpc 20030209, based on gcc-3.2.1 does not compile the FE program, but that is another story, which is in my next mail.
Regards, Marten Jan
Marten Jan de Ruiter wrote:
I got some trouble compiling a finite element program called charles.
I used
GNU Pascal version 20020910, based on gcc-2.95.2 19991024 (release).
to compile the program with profiling using the switch -pg. Because the program took an unusual amount of time running, I made a profile. Profiling the run yields that Preparedisposepointer uses exceedingly long times. This is the result for a fairly small problem, used to get output for this mail. I have seen 95% time use for Preparedisposepointer, but lost that profile testing other versions of gpc, and I am not patient enough to regenerate it, now that this output also shows the problem:
I don't know what system you are using, but many C systems have great problems with free() when there are many items to free. I would expect that gpc's new/dispose is using the C runtime malloc/free implementation, but I am prepared for Frank to refute this.
For DJGPP I wrote a replacement, called nmalloc, and available at:
http://cbfalconer.home.att.net/download/nmalloc.zip
which avoids this problem, because free is now an O(1) process in place of O(n). The original O(n) expands to O(n**2) when many items are to be freed.
nmalloc can create a malloc.o object file, which can be linked ahead of the CRT0 library module, and eliminates the problem. This is only tested under DJGPP, and compiled with GCC. The only system dependance is the use of sbrk() to acquire memory to manage. The module (compiled with NDEBUG set) generates the malloc, free, and realloc functions.
If you are willing to descend to C you can use the included evilalgo.c program to see some of the effects. The problem also shows up dramatically in:
http://cbfalconer.home.att.net/download/hashlib.zip
where some of the tests isolate the effect. In particular try:
hashtest 4 100000 (which does all the frees)
vs
hashtest 4 100001 (which suppresses the frees)
and time the runs. Use larger values if needed. The 'oddness' of the second parameter does the free suppression. Hashlib is completely portable, while nmalloc is not. However most Unices etc. supply the raw memory via sbrk, so compilation of nmalloc there (with gcc) should usually function.
Aside to Frank: If I am correct, you are perfectly free to incorporate nmalloc in the gpc system. It also detects a subset of possible trashed arenas, and raises SIGABRT in those cases.
CBFalconer wrote:
I don't know what system you are using, but many C systems have great problems with free() when there are many items to free. I would expect that gpc's new/dispose is using the C runtime malloc/free implementation, but I am prepared for Frank to refute this.
Yes, it is (by default, it can be changed at link time, as you described, or at run time, using GetMemPtr etc.).
For DJGPP I wrote a replacement, called nmalloc, and available at:
http://cbfalconer.home.att.net/download/nmalloc.zip
which avoids this problem, because free is now an O(1) process in place of O(n). The original O(n) expands to O(n**2) when many items are to be freed.
[...]
Aside to Frank: If I am correct, you are perfectly free to incorporate nmalloc in the gpc system. It also detects a subset of possible trashed arenas, and raises SIGABRT in those cases.
I tried it under Linux, and didn't make a significant difference. I didn't try it on other systems, but have you tried it on anything else but DJGPP?
Perhaps it's a problem on some systems, but apparently not all. That's one reason why I wouldn't include it by default. Another one is that people might want to link in an alternative malloc (perhaps another optimized version, or a GC one, etc.). Also, the dependency on sbrk might be problematic (it's neither POSIX not C standard; most Unix systems should have it, also DJGPP, but I'm not sure about, e.g., Windows; it's a single routine, but I think one that can't be easily worked around on systems with substantially different memory management).
However, if you like I can put your packages in `contrib' on the web site. (Or you can get an account if you plan to upload or change your contributions regularly.)
BTW, there were a few compile problems (conflicts with `ulong' and `roundup'; uses `fakesbrk' by default; includes "cokusMT.h" while the zip contains a file `cokusmt.h'; works only on case-insensitive file systems), and some little Makefile problems -- it omits `-lm' (and sets, but doesn't use `CFLAGS', so `-lm' can't be added there), and it adds `.exe' which is at least strange on non-Dos systems; I think even gcc on DJGPP adds it automatically.
Frank
Frank Heckenbach wrote:
... snip ...
BTW, there were a few compile problems (conflicts with `ulong' and `roundup'; uses `fakesbrk' by default; includes "cokusMT.h" while the zip contains a file `cokusmt.h'; works only on case-insensitive file systems), and some little Makefile problems -- it omits `-lm' (and sets, but doesn't use `CFLAGS', so `-lm' can't be added there), and it adds `.exe' which is at least strange on non-Dos systems; I think even gcc on DJGPP adds it automatically.
You need to compile it with the -DNDEBUG switch, otherwise it uses the fakesbrk etc. for testing and debugging. I believe "make malloc.o" will do it all. It is undergoing revision for added debugging capabilities under DJGPP, not to alter its function.
Marten Jan de Ruiter wrote:
to compile the program with profiling using the switch -pg. Because the program took an unusual amount of time running, I made a profile. Profiling the run yields that Preparedisposepointer uses exceedingly long times. This is the result for a fairly small problem, used to get output for this mail. I have seen 95% time use for Preparedisposepointer, but lost that profile testing other versions of gpc, and I am not patient enough to regenerate it, now that this output also shows the problem:
/hole 17 % gprof `which charles` | more Flat profile:
Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 59.13 3.14 3.14 Preparedisposepointer 3.20 3.31 0.17 _p_read_longreal 3.20 3.48 0.17 _p_trim ...
(sorry for the crappy lay-out)
Charles, compiled by an older version (see below) of gpc (seems to) works fine, but suffers from a memory leak in the file-io routines.
I was not able to locate the difference in the pascal sources that explains why Preparedisposepointer now needs so much time. However, I suspect FreeMemPtr^ (in rts/heap.pas) does some things. A colleague mentioned that it might be a lot of moving in memory to prevent fragmentation.
Allocated memory is never moved (this would break pointers pointing to it).
By default FreeMemPtr points to free(). I'm not sure if profiling includes subroutines (in particular, libc routines which probably were not compiled with `-pg'). If so, you could try to move the call into a subroutine to find out if free() or PrepareDisposePointer actually takes the time.
PrepareDisposePointer shouldn't do much unless `Mark' is used (anywhere in the program) which, e.g., the `HeapMon' unit does. If you use either of them, this probably explains it. `Mark' is not optimized for speed (I consider it mostly an outdated and/or debugging thing).
If that's an issue for you, feel free to improve it (perhaps use a hash table instead of the list etc., but note that nested `Mark's should work) -- it's Pascal code, so the usual argument for staying away from the internals doesn't count. ;-)
Now I have the following questions: Any idea why deallocation takes forever in the FE program as compared to allocation? Maybe the reason is that the data-structure is slightly more complicated than a linked list of integers ;-)
The data structure doesn't matter at all to the heap management. The size of the blocks might have some influence, but I don't suspect so here (if anything, smaller blocks (integers) should be slower -- if you compare the same absolute memory size, of course).
Frank
Tue Mar 18, 12:55:35, Marten Jan de Ruiter wrote
Now I have the following questions: Any idea why deallocation takes forever in the FE program as compared to allocation? Maybe the reason is that the data-structure is slightly more complicated than a linked list of integers ;-) How can I isolate the problem? What kind of feedback can I provide you with to isolate and solve the problem?
Gpc 20030209, based on gcc-3.2.1 does not compile the FE program, but that is another story, which is in my next mail.
Another gcc-3.2.1 based compiler does compile: gpc --version gpc 20021128, based on gcc-3.2.1 However, the profiling does not improve.
Flat profile:
Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 89.16 23.36 23.36 Preparedisposepointer 0.95 23.61 0.25 _p_read_longreal 0.84 23.83 0.22 _p_internal_getc
Wed Mar 19, 05:54:17, Frank Heckenbach wrote:
By default FreeMemPtr points to free(). I'm not sure if profiling includes subroutines (in particular, libc routines which probably were not compiled with `-pg'). If so, you could try to move the call into a subroutine to find out if free() or PrepareDisposePointer actually takes the time.
I do not know how exactly to do this. Here I insert some code that I expect to do it, and some relevant code for reference. However, I get an error: assignment from incompatible pointer type, because I do not know how to setup the pointer to my own routine.
Snippets from rts/heap.pas: procedure CFreeMem (aPointer: Pointer); asmname 'free'; ... FreeMemType = ^procedure (aPointer: Pointer); ... PMarkList = ^TMarkList; TMarkList = record Next, Prev : PMarkList; Marked : Boolean; MaxIndexUsed, PointersUsed: Integer; Entries : array [0 .. 255] of record Ptr : Pointer; PSize : SizeType; Caller: Pointer end end; ... FreeMemPtr : FreeMemType = @CFreeMem; asmname '_p_FreeMemPtr'; ... procedure PrepareDisposePointer (aPointer: Pointer); var p: PMarkList; ... FreeMemPtr^ (p) ...
program TestFree;
procedure CFreeMem (aPointer: Pointer); asmname 'free';
type LL_t = record {Linked List type} i : integer; next : ^LL_t; end;
FreeMemType = ^procedure (aPointer: Pointer);
PMarkList = ^TMarkList; TMarkList = record Next, Prev : PMarkList; Marked : Boolean; MaxIndexUsed, PointersUsed: Integer; Entries : array [0 .. 255] of record Ptr : Pointer; PSize : SizeType; Caller: Pointer end end;
var FreeMemPtr : FreeMemType = @CFreeMem; asmname '_p_FreeMemPtr'; OrigFreeMemPtr : FreeMemType; i : integer; LLfirst, LLcurrent, LLnew: ^LL_t;
procedure CallFreeEmjay(var p:PMarkList); begin OrigFreeMemPtr^(p); end;
begin {program}
{reroute the calls} OrigFreeMemPtr:=FreeMemPtr; FreeMemPtr:=@CallFreeEmjay; {Don't know how to do this???}
{test} new(LLfirst); {create first link} LLfirst^.i:=1; LLfirst^.next:=nil; LLcurrent:=LLfirst; {add other links} for i:=2 to 100000 do begin new(LLnew); LLnew^.i:=i; LLnew^.next:=LLcurrent^.next; LLcurrent^.next:=LLnew; LLcurrent:=LLcurrent^.next;\ end; {dispose all links again} LLnew:=nil; while LLfirst<> nil do begin LLcurrent:=LLfirst; {get the first link} LLfirst:=LLfirst^.next; {hold the second} dispose(LLcurrent) {throw away the first link} end;
end.
Marten Jan de Ruiter wrote:
Wed Mar 19, 05:54:17, Frank Heckenbach wrote:
By default FreeMemPtr points to free(). I'm not sure if profiling includes subroutines (in particular, libc routines which probably were not compiled with `-pg'). If so, you could try to move the call into a subroutine to find out if free() or PrepareDisposePointer actually takes the time.
I do not know how exactly to do this. Here I insert some code that I expect to do it, and some relevant code for reference. However, I get an error: assignment from incompatible pointer type, because I do not know how to setup the pointer to my own routine.
procedure CallFreeEmjay (p: Pointer);
And do not copy the variable declaration from heap.pas (otherwise you get two distinct variables), but use the GPC unit (which also declares CGetMem etc., so you don't need to copy this, either).
Frank