
Pete wrote:
Thank you both Oliver and Michael for your replies.
I'm not convinced that the problem is memory related. I can guarantee that I have no leaks under Windows and based on the facts that I did not notice any unusual growth in memory usage under the code difference between the windows service and the Unix daemon are trivial I'm unconvinced that it is the application that leaks.
Windows and Unix have different memory models, it's possible that windows is more forgiving with certain kinds of memory accesses than linux. Many unix's will crash if you free memory that is already free'd, or access memory via a dangling pointer. I don't know much about windows but it's concievable that windows happens to ignore the error. (Some versions of windows for instance would let you dereference null). Maybe a floating point error that windows transmutes into "NaN" but linux raises a fatal signal for? Some other hardware platforms have alignment restrictions, if you access memory in certain ways they will crash. You may also be running into reentrancy problems with the standard library, although I'd suspect that this is unlikely.
Personally I fear that it has something to do with kernel parameter settings (I'm suspicious of maxdsiz and maxssiz) under HP-UX, but I'm really not qualified enough to make such a claim. A fact is that the application is naturally memory hungry (I have seen it peak at over 100MB under heavy load) and it may well be that under HP-UX it cores when it hits an OS imposed wall.
I wouldn't suspect so. You usually have to go a long long way before the OS kicks in and stops you doing things. 100MB is a "light" process as far as the big unicies are concerned.
I have analysed the core dump files many times using gdb but the failure occures at random places and I found nothing obvious. The only strange thing is that it appears that the application bombs once it ties to allocate morte than 8MB of RAM from the heap.
8MB is nothing, you shouldn't have issues until you've gone well over 500mb on an Intel machine.
There is one more major difference between the HP-UX and the other environments: the HP9000 is a multi CPU machine, while both my Linux and WinXP Pro boxes a are single CPU machines. For this reason, concurrency issues (thread synchronisation) is another area I'd like to focus on when chasing this bug.
Anyway, I'm at my wits end and need help from someone smarter.
What signal is it dying with? Unix kills processes by sending them various signals, and knowing what the type of signal is helps diagnose the problem. When you use gdb on a core file it usually says near the beginning "Program exited with signal 11 (Segmentation Fault)". Often your shell says what a process exits with too. If it exits on "Abort" then it means that the compiler/standard library found a bug (eg, an uncaught exception). Ditto for "Quit". If it exits on "Bus error" then it's probably that you're doing something with pointers that the hardware doesn't support (eg, writing to a non aligned address). If it exits on "Segmentation fault" then it's probably that you're doing something silly with pointers (eg, dereferencing a NULL pointer, dereferencing a free()'d pointer, or similar). If it's something more exotic (Stack Fault/Maximum CPU Exceeded) then you're probably running into an operating system limit. There is documentation about Unix signals available on the wiki here: http://www.wlug.org.nz/UnixSignals Linux has a great program called "valgrind" that can be used for debugging memory and threading issues. If you can get it to compile under Linux on x86 I'd **highly** recommend you run valgrind on your program.