
Hi UNIXers, I'm working on a port of a very large C++ project from Windows to Unix. The aim is to get the system running successfully under HP-UX 11i. The application is multithreaded and runs as a service under Windows and as a daemon under Unix. The application uses the 3rd party libraries boost, CenterPoint XML, OTL++ and libodbc++. It currently interacts with SQL Server and Oracle databases. I have to say that I'm definitively not a Unix person; my skills are in the Windows arena. Even so I have managed to convert the code successfully so that it runs fine under Linux RH8 and RH9 (compiled using GCC 3.2). However, I'm extremely frustrated to find that I have a major problem with the software under HP-UX: it seems to CORE dump after running for "some time" (typically 2 - 3hrs). I simply don't know enough about HP-UX and urgently need someone assist me. The skills that I'm seeking are: - HP9000/HP-UX 11i - C++ - STL, pthread (or even better boost::thread) - GCC and GDB One word of caution: the HP9000 machine I'm using for testing is located in the USA and Internet access is relatively slow. I have managed using an X-client, but response times make this a fairly impracticable option. I'd love to hear from anyone who might be interested. Peter Hug Warehouse Optimization LLC Phone +64-6-8354933

using GCC 3.2). However, I'm extremely frustrated to find that I have a major problem with the software under HP-UX: it seems to CORE dump after running for "some time" (typically 2 - 3hrs).
I see you are using gcc and gdb. You can run the program under gdb to see what it does. I see you are using multithreading: that makes tracking it with gdb more complicated. I have found with processes that are forked that gdb gets associated with the parent process and if the bug is in the child process it is hard to track. I haven't gone deeply enough into it to find out if there is a way of associating gdb with the correct process. Alternatively, since the daemon core dumps, you can run it as normal until it crashes and then use gdb to load the core dump and report the state of the process at the time of dumping. You compile the program with the -g option (for debugging info) and run it. When it has faulted and core dumped (make sure you allow it to write a full core dump file) you then run gdb as: gdb <executable> <coredumpfile> and you can use the 'bt' command (for backtrace, I think) to get a report on the stack and which routines called which one, and at exactly what point the program faulted. You can use the 'print' command to check local variables (of the current frame) and global variables as they were at the time of the program core dumping. That might reveal what happened. This can all be run with a text terminal only thus solving your problem with a slow X Window connection. I personally have no experience with HP-UX so can't help with any issues specific to HP-UX, but have used gcc and gdb under three different flavours of Unix (four if you count Mac OS X). If the problem is memory leaks and the like you could link the daemon with a specialised memory management library that mungs memory that is allocated (i.e. puts in pointer values outside the current process memory location so that if they are dereferenced they will cause a segmentation violation) and puts walls around memory allocations so that if you overwrite a memory allocation it is detected when you free the memory. I have found such tools very useful for tracking memory overwrite problems. Michael.

Indeed. His problems sound memory related. A double free, leak or a dereference of an invalid pointer. Tools like ValGrind, dmalloc, ElectricFence and the like will probably help.
If the problem is memory leaks and the like you could link the daemon with a specialised memory management library that mungs memory that is allocated (i.e. puts in pointer values outside the current process memory location so that if they are dereferenced they will cause a segmentation violation) and puts walls around memory allocations so that if you overwrite a memory allocation it is detected when you free the memory. I have found such tools very useful for tracking memory overwrite problems.
Regards -- Oliver Jones » Director » oliver(a)deeperdesign.com » +64 (21) 41 2238 Deeper Design Limited » +64 (7) 377 3328 » www.deeperdesign.com

Thank you both Oliver and Michael for your replies. I'm not convinced that the problem is memory related. I can guarantee that I have no leaks under Windows and based on the facts that I did not notice any unusual growth in memory usage under the code difference between the windows service and the Unix daemon are trivial I'm unconvinced that it is the application that leaks. Personally I fear that it has something to do with kernel parameter settings (I'm suspicious of maxdsiz and maxssiz) under HP-UX, but I'm really not qualified enough to make such a claim. A fact is that the application is naturally memory hungry (I have seen it peak at over 100MB under heavy load) and it may well be that under HP-UX it cores when it hits an OS imposed wall. I have analysed the core dump files many times using gdb but the failure occures at random places and I found nothing obvious. The only strange thing is that it appears that the application bombs once it ties to allocate morte than 8MB of RAM from the heap. There is one more major difference between the HP-UX and the other environments: the HP9000 is a multi CPU machine, while both my Linux and WinXP Pro boxes a are single CPU machines. For this reason, concurrency issues (thread synchronisation) is another area I'd like to focus on when chasing this bug. Anyway, I'm at my wits end and need help from someone smarter.
Indeed. His problems sound memory related. A double free, leak or a dereference of an invalid pointer. Tools like ValGrind, dmalloc, ElectricFence and the like will probably help.
If the problem is memory leaks and the like you could link the daemon with a specialised memory management library that mungs memory that is allocated (i.e. puts in pointer values outside the current process memory location so that if they are dereferenced they will cause a segmentation violation) and puts walls around memory allocations so that if you overwrite a memory allocation it is detected when you free the memory. I have found such tools very useful for tracking memory overwrite problems.

I have analysed the core dump files many times using gdb but the failure occures at random places and I found nothing obvious. The only strange thing is that it appears that the application bombs once it ties to allocate morte than 8MB of RAM from the heap.
The 8MB figure looks familiar. It is the default size that many Unices allocate to the program's stack. Could you be overwriting the program stack? Allocating too many local variables can cause this - normally only achieved by recursive entry of a single routine since 8MB is usually a huge stack. Admittedly I'm stabbing in the dark here! With crashes at random places I'm not so certain this is likely to be the explanation.
There is one more major difference between the HP-UX and the other environments: the HP9000 is a multi CPU machine, while both my Linux and WinXP Pro boxes a are single CPU machines. For this reason, concurrency issues (thread synchronisation) is another area I'd like to focus on when chasing this bug.
And this is also where you step out of my experience too. I have run my compiled programs on multiprocessor Alphas and also Linices, but they are really only single process programs. I have never had to use threads and have used fork rarely so don't have much experience in debugging such programs. The techniques I described in my last post are what I have found very useful in my own programming. Michael.

The 8MB figure looks familiar. It is the default size that many Unices allocate to the program's stack.
What really puzzles me is that when the HP-UX bible talks about maxdsiz (maximum size of data segment) and maxssiz (maximum size of stack segment) it states for both that the size affects the size of the local heap. This seems weird, I mean the local heap surely resides EITHER on the stack OR in the data segment but not both?!?!
And this is also where you step out of my experience too. I have run my compiled programs on multiprocessor Alphas and also Linices, but they are really only single process programs. I have never had to use threads and have used fork rarely so don't have much experience in debugging such programs. The techniques I described in my last post are what I have found very useful in my own programming.
There is a debugger available under HP-UX which is based on gdb but requires an X-client (it's called wdb). The debugger is excellent and offers all essential features and more, but with the relatively slow connection to the server via an Internet VPN connection it is hardly usable. Many thanks for your help!

Pete wrote:
The 8MB figure looks familiar. It is the default size that many Unices allocate to the program's stack.
What really puzzles me is that when the HP-UX bible talks about maxdsiz (maximum size of data segment) and maxssiz (maximum size of stack segment) it states for both that the size affects the size of the local heap. This seems weird, I mean the local heap surely resides EITHER on the stack OR in the data segment but not both?!?!
The heap is (virtual address space) - (stack size) - (data size) - (code size). So increasing the stack or data sizes will shrink your heap size.

I'm not convinced that the problem is memory related. I can guarantee that I have no leaks under Windows and based on the facts that I did not notice any unusual growth in memory usage under the code difference between the windows service and the Unix daemon are trivial I'm unconvinced that it is the application that leaks. Personally I fear that it has something to do with kernel parameter settings (I'm suspicious of maxdsiz and maxssiz) under HP-UX, but I'm really not qualified enough to make such a claim. A fact is that the application is naturally memory hungry (I have seen it peak at over 100MB under heavy load) and it may well be that under HP-UX it cores when it hits an OS imposed wall.
In that case it could be as simple as a ulimit setting. man 3 ulimit. man 1 ulimit (bash). man 2 getrlimit. You should also ensure you code is handling malloc() and "new" failures appropriately.
There is one more major difference between the HP-UX and the other environments: the HP9000 is a multi CPU machine, while both my Linux and WinXP Pro boxes a are single CPU machines. For this reason, concurrency issues (thread synchronisation) is another area I'd like to focus on when chasing this bug.
Indeed thread concurrency issues could be another cause. Can you set the CPU affinity on the HP-UX machine? If you can, set the process to say on a single CPU and see if the problem goes away.
Anyway, I'm at my wits end and need help from someone smarter.
Regards -- Oliver Jones » Director » oliver(a)deeperdesign.com » +64 (21) 41 2238 Deeper Design Limited » +64 (7) 377 3328 » www.deeperdesign.com

In that case it could be as simple as a ulimit setting. man 3 ulimit. man 1 ulimit (bash). man 2 getrlimit.
This is extremely helpful - I had no idea of such limitations and I'm working on finding out what they are. Many thanks.
You should also ensure you code is handling malloc() and "new" failures appropriately.
This is a real problem because I'm dealing with lots of legacy code. Newer code correctly traps allocation errors using try/catch blocks, but a lot of legacy code uses incorrect constructs like "if (!(p = new P)) // handle allocation error". If I overload the new operator to handle allocation errors correctly I break the code that expects exceptions to be throw, if I don't, the application can core because of an unhandled exception. Many thanks Oliver!

Pete wrote:
Thank you both Oliver and Michael for your replies.
I'm not convinced that the problem is memory related. I can guarantee that I have no leaks under Windows and based on the facts that I did not notice any unusual growth in memory usage under the code difference between the windows service and the Unix daemon are trivial I'm unconvinced that it is the application that leaks.
Windows and Unix have different memory models, it's possible that windows is more forgiving with certain kinds of memory accesses than linux. Many unix's will crash if you free memory that is already free'd, or access memory via a dangling pointer. I don't know much about windows but it's concievable that windows happens to ignore the error. (Some versions of windows for instance would let you dereference null). Maybe a floating point error that windows transmutes into "NaN" but linux raises a fatal signal for? Some other hardware platforms have alignment restrictions, if you access memory in certain ways they will crash. You may also be running into reentrancy problems with the standard library, although I'd suspect that this is unlikely.
Personally I fear that it has something to do with kernel parameter settings (I'm suspicious of maxdsiz and maxssiz) under HP-UX, but I'm really not qualified enough to make such a claim. A fact is that the application is naturally memory hungry (I have seen it peak at over 100MB under heavy load) and it may well be that under HP-UX it cores when it hits an OS imposed wall.
I wouldn't suspect so. You usually have to go a long long way before the OS kicks in and stops you doing things. 100MB is a "light" process as far as the big unicies are concerned.
I have analysed the core dump files many times using gdb but the failure occures at random places and I found nothing obvious. The only strange thing is that it appears that the application bombs once it ties to allocate morte than 8MB of RAM from the heap.
8MB is nothing, you shouldn't have issues until you've gone well over 500mb on an Intel machine.
There is one more major difference between the HP-UX and the other environments: the HP9000 is a multi CPU machine, while both my Linux and WinXP Pro boxes a are single CPU machines. For this reason, concurrency issues (thread synchronisation) is another area I'd like to focus on when chasing this bug.
Anyway, I'm at my wits end and need help from someone smarter.
What signal is it dying with? Unix kills processes by sending them various signals, and knowing what the type of signal is helps diagnose the problem. When you use gdb on a core file it usually says near the beginning "Program exited with signal 11 (Segmentation Fault)". Often your shell says what a process exits with too. If it exits on "Abort" then it means that the compiler/standard library found a bug (eg, an uncaught exception). Ditto for "Quit". If it exits on "Bus error" then it's probably that you're doing something with pointers that the hardware doesn't support (eg, writing to a non aligned address). If it exits on "Segmentation fault" then it's probably that you're doing something silly with pointers (eg, dereferencing a NULL pointer, dereferencing a free()'d pointer, or similar). If it's something more exotic (Stack Fault/Maximum CPU Exceeded) then you're probably running into an operating system limit. There is documentation about Unix signals available on the wiki here: http://www.wlug.org.nz/UnixSignals Linux has a great program called "valgrind" that can be used for debugging memory and threading issues. If you can get it to compile under Linux on x86 I'd **highly** recommend you run valgrind on your program.

Windows and Unix have different memory models, it's possible that windows is more forgiving with certain kinds of memory accesses than linux.
I first heard about this from one of the developers of the hit game SimCity, who told me that there was a critical bug in his application: it used memory right after freeing it, a major no-no that happened to work OK on DOS but would not work under Windows where memory that is freed is likely to be snatched up by another running application right away. The testers on the Windows team were going through various popular applications, testing them to make sure they worked OK, but SimCity kept crashing. They reported this to the Windows developers, who disassembled SimCity, stepped through it in a debugger, found the bug, and added special code that checked if SimCity was running, and if it did, ran the memory allocator in a special mode in which you could still use memory after freeing it. -- Joel On Software, http://www.joelonsoftware.com/articles/APIWar.html That's pretty forgiving. ;) Craig

* Craig Box <craig(a)dubculture.co.nz> [2004-06-17 07:03]:
[...] it used memory right after freeing it, a major no-no that happened to work OK on DOS but would not work under Windows [...]
That's pretty forgiving. ;)
It's not Windows being forgiving here, though, it's DOS. Regards, -- Aristotle "If you can't laugh at yourself, you don't take life seriously enough."
participants (6)
-
A. Pagaltzis
-
Craig Box
-
Michael Cree
-
Oliver Jones
-
Perry Lorier
-
Pete