CPU load average not being graphed for some servers
Hi all,
I'm having trouble tracking this one down. For many of my Xymon-monitored servers I can see the graphs on the CPU load page and they have data in them. Yet for many others the CPU graphs are empty. I can see the graphs but there is no data.
Investigation reveals that for those servers who are missing data the .cpu file has not been updated in months, yet when I look at the client data available it shows the load average in the uptime section, as in:
[uptime]
5:07pm 2 users, load average: 0.53, 0.71, 0.62
which is, I believe, where Xymon gets the LA from.
The file ownership and permissions on the data/hist/*.cpu files are all correct (hobbit:hobbit and 644 on this system), in fact the only thing I noticed that doesn't make sense is that the data/hostdata/ directories are not being updated for those servers that do not have the CPU graphs, even though I can see the host data in the Xymon web pages.
Confused? I know I am.
|\/|
--
Martin Ward
Manager, Technical Services
DDI:+44 (0) 20 7863 5218 / Fax: +44 (0)20 7863 9999 / www.colt.net <http://www.colt.net/>
Colt Technology Services, Unit 12, Powergate Business Park, Volt Avenue, Park Royal, London, NW10 6PW, UK.
Help reduce your carbon footprint | Think before you print. Registered in England and Wales, registered number 02452736, VAT number GB 645 4205 50
[Colt Disclaimer] The message is intended for the named addressee only and may not be disclosed to or used by anyone else, nor may it be copied in any way. The contents of this message and its attachments are confidential and may also be subject to legal privilege. If you are not the named addressee and/or have received this message in error, please advise us by e-mailing abuse at colt.net and delete the message and any attachments without retaining any copies. Internet communications are not secure and Colt does not accept responsibility for this message, its contents nor responsibility for any viruses. No contracts can be created or varied on behalf of Colt Technology Services, its subsidiaries, group companies or affiliates ("Colt") and any other party by email communications unless expressly agreed in writing with such other party. Please note that incoming emails will be automatically scanned to eliminate potential viruses and unsolicited promotional emails. For more information refer to www.colt.net or contact us on +44(0)20 7390 3900
From: Ward, Martin [mailto:Martin.Ward at colt.net]
Hi all,
I'm having trouble tracking this one down. For many of my Xymon-monitored servers I can see the graphs on the CPU load page and they have data in them. Yet for many others the CPU graphs are empty. I can see the graphs but there is no data.
Investigation reveals that for those servers who are missing data the .cpu file has not been updated in months, yet when I look at the client data available it shows the load average in the uptime section, as in:
[uptime]
5:07pm 2 users, load average: 0.53, 0.71, 0.62
which is, I believe, where Xymon gets the LA from.
Martin,
This may be related to an issue I came across last year.
The (cpu load and users & processes) graphs appear to be dependent on the exact output of the 'uptime' command.
In my case, the graphs did show for the first 24 hours of uptime, didn't show after 24 hours, then started showing again after 48 hours of uptime.
I was finally able to isolate this to the output of 'uptime' for the first full day - it showed 'day' not 'days'.
uptime
12:41pm up 196 days 21:49, 5 users, load average: 0.23, 0.19, 0.17
uptime
12:41pm up 1 day 21:49, 2 users, load average: 0.06, 0.02, 0.00
I 'fixed' this by updating hobbitclient-linux.sh and modified the output of the 'uptime' command replacing 'day ' with 'days '.
echo "[uptime]"
uptime | perl -pe "s/^(.*) day (.*)/\1 days \2/"
It appears your uptime command does not show the 'up x days HH:MM'.
Cheers,
Tom Brand
Thanks Tom, that got me sorted. It seems that Solaris relies on the BOOT_TIME record held in the /var/adm/utmpx file. This file has been rotated out of the way in order to save disk space so I got no uptime values at all. It looks like this messed with the load average data since the uptime output didn't have any uptime in it.
Like you I have hacked the hobbitclient-sunos.sh file and put in a small perl scriptlet so that if there is no uptime it adds a fake value in just to ensure that the load averages get stored properly.
|\/|
From: Brand, Thomas R. [mailto:TRBrand at cvs.com] Sent: 13 October 2010 17:53 To: xymon at xymon.com Subject: RE: [xymon] CPU load average not being graphed for some servers
From: Ward, Martin [mailto:Martin.Ward at colt.net]
Hi all,
I'm having trouble tracking this one down. For many of my Xymon-monitored servers I can see the graphs on the CPU load page and they have data in them. Yet for many others the CPU graphs are empty. I can see the graphs but there is no data.
Investigation reveals that for those servers who are missing data the .cpu file has not been updated in months, yet when I look at the client data available it shows the load average in the uptime section, as in:
[uptime]
5:07pm 2 users, load average: 0.53, 0.71, 0.62
which is, I believe, where Xymon gets the LA from.
Martin,
This may be related to an issue I came across last year.
The (cpu load and users & processes) graphs appear to be dependent on the exact output of the 'uptime' command.
In my case, the graphs did show for the first 24 hours of uptime, didn't show after 24 hours, then started showing again after 48 hours of uptime.
I was finally able to isolate this to the output of 'uptime' for the first full day - it showed 'day' not 'days'.
uptime
12:41pm up 196 days 21:49, 5 users, load average: 0.23, 0.19, 0.17
uptime
12:41pm up 1 day 21:49, 2 users, load average: 0.06, 0.02, 0.00
I 'fixed' this by updating hobbitclient-linux.sh and modified the output of the 'uptime' command replacing 'day ' with 'days '.
echo "[uptime]"
uptime | perl -pe "s/^(.*) day (.*)/\1 days \2/"
It appears your uptime command does not show the 'up x days HH:MM'.
Cheers,
Tom Brand
[Colt Disclaimer] The message is intended for the named addressee only and may not be disclosed to or used by anyone else, nor may it be copied in any way. The contents of this message and its attachments are confidential and may also be subject to legal privilege. If you are not the named addressee and/or have received this message in error, please advise us by e-mailing abuse at colt.net and delete the message and any attachments without retaining any copies. Internet communications are not secure and Colt does not accept responsibility for this message, its contents nor responsibility for any viruses. No contracts can be created or varied on behalf of Colt Technology Services, its subsidiaries, group companies or affiliates ("Colt") and any other party by email communications unless expressly agreed in writing with such other party. Please note that incoming emails will be automatically scanned to eliminate potential viruses and unsolicited promotional emails. For more information refer to www.colt.net or contact us on +44(0)20 7390 3900
On Thursday, 14 October 2010 15:46:28 Ward, Martin wrote:
Thanks Tom, that got me sorted. It seems that Solaris relies on the BOOT_TIME record held in the /var/adm/utmpx file. This file has been rotated out of the way in order to save disk space so I got no uptime values at all. It looks like this messed with the load average data since the uptime output didn't have any uptime in it.
Like you I have hacked the hobbitclient-sunos.sh file and put in a small perl scriptlet so that if there is no uptime it adds a fake value in just to ensure that the load averages get stored properly.
It would be useful if you could instead supply the "client data" for the host in the events where this breaks, so it can be fixed in hobbitd_client instead.
Regards, Buchan
Hi Buchan,
I did provide the output from the client data for the particular section, I didn't think it made sense to dump the whole output packet when it's only two lines that are the problem. Still, here it is again with more detailed information on the issue, hopefully this will help someone to code around this.
The original issue I had was that if the /var/adm/utmpx file didn't exist or didn't contain a BOOT_TIME record then the output of the uptime(1) command looked like this in the Xymon client data:
[uptime] 11:25am 1 user, load average: 1.21, 0.64, 0.46 [who] ...
For reasons unknown (because I haven't dug through the code) this stopped Xymon from logging the load average data even though it displayed it at the top of the "cpu" web page.
The Xymon code seems to require the output of uptime to look like this: [uptime] 12:29pm up 133 day(s), 2:34, 5 users, load average: 4.90, 4.63, 4.41 [who] ...
If it helps you any I have also seen uptime output, when the uptime is less than one day, of: [uptime] 12:29pm up 2:34, 5 users, load average: 4.90, 4.63, 4.41 [who] ...
I hope this helps,
|\/|
Martin Ward Manager, Technical Services
DDI:+44 (0) 20 7863 5218 / Fax: +44 (0)20 7863 9999 / www.colt.net Colt Technology Services, Unit 12, Powergate Business Park, Volt Avenue, Park Royal, London, NW10 6PW, UK.
Help reduce your carbon footprint | Think before you print. Registered in England and Wales, registered number 02452736, VAT number GB 645 4205 50
-----Original Message----- From: Buchan Milne [mailto:bgmilne at staff.telkomsa.net] Sent: 15 October 2010 12:21 To: xymon at xymon.com Cc: Ward, Martin Subject: Re: [xymon] CPU load average not being graphed for some servers
On Thursday, 14 October 2010 15:46:28 Ward, Martin wrote:
Thanks Tom, that got me sorted. It seems that Solaris relies on the BOOT_TIME record held in the /var/adm/utmpx file. This file has been rotated out of the way in order to save disk space so I got no uptime values at all. It looks like this messed with the load average data since the uptime output didn't have any uptime in it.
Like you I have hacked the hobbitclient-sunos.sh file and put in a small perl scriptlet so that if there is no uptime it adds a fake value in just to ensure that the load averages get stored properly.
It would be useful if you could instead supply the "client data" for the host in the events where this breaks, so it can be fixed in hobbitd_client instead.
Regards, Buchan
[Colt Disclaimer] The message is intended for the named addressee only and may not be disclosed to or used by anyone else, nor may it be copied in any way. The contents of this message and its attachments are confidential and may also be subject to legal privilege. If you are not the named addressee and/or have received this message in error, please advise us by e-mailing abuse at colt.net and delete the message and any attachments without retaining any copies. Internet communications are not secure and Colt does not accept responsibility for this message, its contents nor responsibility for any viruses. No contracts can be created or varied on behalf of Colt Technology Services, its subsidiaries, group companies or affiliates ("Colt") and any other party by email communications unless expressly agreed in writing with such other party. Please note that incoming emails will be automatically scanned to eliminate potential viruses and unsolicited promotional emails. For more information refer to www.colt.net or contact us on +44(0)20 7390 3900
participants (3)
-
bgmilne@staff.telkomsa.net
-
Martin.Ward@colt.net
-
TRBrand@cvs.com