My Data Center operators have informed me that the Critical Systems Page becomes unavailable for them due to an "Internal Server Error". This occurs predictably every night for them at approximately 2:45 (GMT-5) and lasts approximately 30 minutes.
This problem has been impacting their ability to ensure that they properly inform us if any of our systems with sensitive SLAs become unavailable. A quick solution would be appreciated!!
I started monitoring the Critical Page through Xymon itself for about a week now and have included the history. There seems to be no useful information that I have found in any Xymon or Apache logs. I also see no way of enabling any debugging logs for this script. I enabled the --debug option in cgioptions.cfg, but that cause a bunch of info to be displayed on the page, and that produced even more complaints, so I turned it back off.
There is a series of entries in /var/log/messages every time the URL for the page is requested. During the time when the Critical Systems Page is unavailable, no other CGI script that is part of Xymon has any issues. (I offer that up because all of these scripts are the same, hard-linked file.)
The /var/log/message file shows me that ABRT is also capturing crash information. It *was* discarding this information until I enabled the option for ProcessUnpackaged in /etc/abrt/abrt-action-save-package-data.conf. I have included all of the files (excluding the sosreport data; not sure if that might contain sensitive infomation or not).
All of this information I have placed into my GitHub repo for people to view: https://github.com/edschminke/xymon
If I can include any other information about this crash, please let me know.
Thanks!
Erik D. Schminke | Associate Systems Programmer Hormel Foods Corporation | One Hormel Place | Austin, MN 55912 Phone: (507) 434-6817 edschminke at hormel.com | www.hormelfoods.com