Possible Bug in Xymon Related to Purple Statuses
Hello,
I am having an issue with Xymon where instead of tests going purple when the client stops reporting, the tests are going clear.
I noticed this with a host that had all it's tests go clear instead of purple. Turns out the network interface on the machine had completely died and this had happened a week ago! We never noticed because instead of going purple the tests for the machine went clear!
This seems to only be an issue with a certain group of machines. For this group of machines, we have the ping test disabled by using the 'noping' option on all of them. This is because they are all behind a firewall with private IP addresses so they are unable to be contacted by the Xymon server. But they can still send client data out to the Xymon server.
Turns out, ever since we started using the 'noping' option for all of them, none of the machines have ever gone purple...
I tested this by stopping the xymon-client service on one of the machines in question, and sure enough, after the STATUSLIFETIME time limit, all the tests for that host went clear, instead of going purple.
I looked through the different logs (I already had most set in debug mode for a different reason), and I didn't see much that would explain this (but I could have missed something).
I did notice in the xymond log file that, according to xymond, they should have been going purple and not clear.
Here's an excerpt from that log file (this is the machine which I stopped the service on):
9680 2015-12-02 09:57:48.040111 -> check_purple_status 9680 2015-12-02 09:57:48.047630 Purple log from <HOST> memory 9680 2015-12-02 09:57:48.047674 ->handle_status 9680 2015-12-02 09:57:48.047676 modifyonly = 0, changed = 0 9680 2015-12-02 09:57:48.047680 - sum: 0, synced: 0, oldcolor: 0, newcolor: 1, modifychanged: 0 9680 2015-12-02 09:57:48.047682 posting to stachg channel: host=<HOST>, test=memory 9680 2015-12-02 09:57:48.047684 -> posttochannel 9680 2015-12-02 09:57:48.047697 Posting message 14359 to 1 readers 9680 2015-12-02 09:57:48.047703 <- posttochannel 9680 2015-12-02 09:57:48.047705 posting to status channel 9680 2015-12-02 09:57:48.047706 -> posttochannel 9680 2015-12-02 09:57:48.047712 Posting message 72429 to 2 readers 9680 2015-12-02 09:57:48.047726 <- posttochannel 9680 2015-12-02 09:57:48.047727 <-handle_status
Basically this showed up for all the different tests for this machine.
And here's the event log for the same machine:
Wed Dec 2 09:57:48 2015 <HOST> cpu [image: green] [image: From -> To] <https://mon.crc.nd.edu/xymon-cgi/historylog.sh?HOST=jim.vectorbase.org&SERVICE=cpu&TIMEBUF=1449068268>[image: clear] Wed Dec 2 09:57:48 2015 <HOST> disk [image: green] [image: From -> To] <https://mon.crc.nd.edu/xymon-cgi/historylog.sh?HOST=jim.vectorbase.org&SERVICE=disk&TIMEBUF=1449068268>[image: clear] Wed Dec 2 09:57:48 2015 <HOST> inode [image: green] [image: From -> To] <https://mon.crc.nd.edu/xymon-cgi/historylog.sh?HOST=jim.vectorbase.org&SERVICE=inode&TIMEBUF=1449068268>[image: clear] Wed Dec 2 09:57:48 2015 <HOST> memory [image: green] [image: From -> To] [image: clear]
Any thoughts as to what's going on? Looks like a bug to me...
Thanks!!
-- Matt Vander Werf
On 12/3/2015 9:43 AM, Matt Vander Werf wrote:
Hello,
I am having an issue with Xymon where instead of tests going purple when the client stops reporting, the tests are going clear.
I noticed this with a host that had all it's tests go clear instead of purple. Turns out the network interface on the machine had completely died and this had happened a week ago! We never noticed because instead of going purple the tests for the machine went clear!
This seems to only be an issue with a certain group of machines. For this group of machines, we have the ping test disabled by using the 'noping' option on all of them. This is because they are all behind a firewall with private IP addresses so they are unable to be contacted by the Xymon server. But they can still send client data out to the Xymon server.
Turns out, ever since we started using the 'noping' option for all of them, none of the machines have ever gone purple...
I tested this by stopping the xymon-client service on one of the machines in question, and sure enough, after the STATUSLIFETIME time limit, all the tests for that host went clear, instead of going purple.
I looked through the different logs (I already had most set in debug mode for a different reason), and I didn't see much that would explain this (but I could have missed something).
I did notice in the xymond log file that, according to xymond, they should have been going purple and not clear.
Here's an excerpt from that log file (this is the machine which I stopped the service on):
9680 2015-12-02 09:57:48.040111 -> check_purple_status 9680 2015-12-02 09:57:48.047630 Purple log from <HOST> memory 9680 2015-12-02 09:57:48.047674 ->handle_status 9680 2015-12-02 09:57:48.047676 modifyonly = 0, changed = 0 9680 2015-12-02 09:57:48.047680 - sum: 0, synced: 0, oldcolor: 0, newcolor: 1, modifychanged: 0 9680 2015-12-02 09:57:48.047682 posting to stachg channel: host=<HOST>, test=memory 9680 2015-12-02 09:57:48.047684 -> posttochannel 9680 2015-12-02 09:57:48.047697 Posting message 14359 to 1 readers 9680 2015-12-02 09:57:48.047703 <- posttochannel 9680 2015-12-02 09:57:48.047705 posting to status channel 9680 2015-12-02 09:57:48.047706 -> posttochannel 9680 2015-12-02 09:57:48.047712 Posting message 72429 to 2 readers 9680 2015-12-02 09:57:48.047726 <- posttochannel 9680 2015-12-02 09:57:48.047727 <-handle_status
Basically this showed up for all the different tests for this machine.
And here's the event log for the same machine:
Wed Dec 2 09:57:48 2015 <HOST> cpu green From -> To <https://mon.crc.nd.edu/xymon-cgi/historylog.sh?HOST=jim.vectorbase.org&SERVICE=cpu&TIMEBUF=1449068268>clear Wed Dec 2 09:57:48 2015 <HOST> disk green From -> To <https://mon.crc.nd.edu/xymon-cgi/historylog.sh?HOST=jim.vectorbase.org&SERVICE=disk&TIMEBUF=1449068268>clear Wed Dec 2 09:57:48 2015 <HOST> inode green From -> To <https://mon.crc.nd.edu/xymon-cgi/historylog.sh?HOST=jim.vectorbase.org&SERVICE=inode&TIMEBUF=1449068268>clear Wed Dec 2 09:57:48 2015 <HOST> memory green From -> To clear
Any thoughts as to what's going on? Looks like a bug to me...
Thanks!!
This is intentional. It's a result of the normal 'purple' behavior when a box is actually down, but it'll give you this behavior if you're testing a box that's "normally" conn-down (as far as xymon knows) anyway.
You'll want to add the 'noclear' line as well to any of the systems that are not pingable if you want the client tests to actually go purple.
https://www.xymon.com/help/manpages/man5/hosts.cfg.5.html#lbAG/ /
/noclear/
/Controls whether stale status messages go purple or clear when a
host is down. Normally, when a host is down the client statuses
("cpu", "disk", "memory" etc) will stop updating - this would
usually make them go "purple" which can trigger alerts. To avoid
that, Xymon checks if the "conn" test has failed, and if that is
true then the other tests will go "clear" instead of purple so you
only get alerts for the "conn" test. If you do want the stale
statuses to go purple, you can use the "noclear" tag to override
this behaviour./
Regards, -jc
Ah, okay. That makes sense! I guess I was thinking that this only happened with other network tests and not the normal tests as well, but that appears to be a different setting. I must have misread or missed the noclear option when I was looking through the documentation....
I think maybe this should be documented better...something like if you use noping then you should also use noclear if you want purple statuses. But that's just my thinking...
Thanks as always, J.C.!!
-- Matt Vander Werf
On Thu, Dec 3, 2015 at 12:49 PM, Japheth Cleaver <cleaver at terabithia.org> wrote:
On 12/3/2015 9:43 AM, Matt Vander Werf wrote:
Hello,
I am having an issue with Xymon where instead of tests going purple when the client stops reporting, the tests are going clear.
I noticed this with a host that had all it's tests go clear instead of purple. Turns out the network interface on the machine had completely died and this had happened a week ago! We never noticed because instead of going purple the tests for the machine went clear!
This seems to only be an issue with a certain group of machines. For this group of machines, we have the ping test disabled by using the 'noping' option on all of them. This is because they are all behind a firewall with private IP addresses so they are unable to be contacted by the Xymon server. But they can still send client data out to the Xymon server.
Turns out, ever since we started using the 'noping' option for all of them, none of the machines have ever gone purple...
I tested this by stopping the xymon-client service on one of the machines in question, and sure enough, after the STATUSLIFETIME time limit, all the tests for that host went clear, instead of going purple.
I looked through the different logs (I already had most set in debug mode for a different reason), and I didn't see much that would explain this (but I could have missed something).
I did notice in the xymond log file that, according to xymond, they should have been going purple and not clear.
Here's an excerpt from that log file (this is the machine which I stopped the service on):
9680 2015-12-02 09:57:48.040111 -> check_purple_status 9680 2015-12-02 09:57:48.047630 Purple log from <HOST> memory 9680 2015-12-02 09:57:48.047674 ->handle_status 9680 2015-12-02 09:57:48.047676 modifyonly = 0, changed = 0 9680 2015-12-02 09:57:48.047680 - sum: 0, synced: 0, oldcolor: 0, newcolor: 1, modifychanged: 0 9680 2015-12-02 09:57:48.047682 posting to stachg channel: host=<HOST>, test=memory 9680 2015-12-02 09:57:48.047684 -> posttochannel 9680 2015-12-02 09:57:48.047697 Posting message 14359 to 1 readers 9680 2015-12-02 09:57:48.047703 <- posttochannel 9680 2015-12-02 09:57:48.047705 posting to status channel 9680 2015-12-02 09:57:48.047706 -> posttochannel 9680 2015-12-02 09:57:48.047712 Posting message 72429 to 2 readers 9680 2015-12-02 09:57:48.047726 <- posttochannel 9680 2015-12-02 09:57:48.047727 <-handle_status
Basically this showed up for all the different tests for this machine.
And here's the event log for the same machine:
Wed Dec 2 09:57:48 2015 <HOST> cpu [image: green] [image: From -> To] <https://mon.crc.nd.edu/xymon-cgi/historylog.sh?HOST=jim.vectorbase.org&SERVICE=cpu&TIMEBUF=1449068268>[image: clear] Wed Dec 2 09:57:48 2015 <HOST> disk [image: green] [image: From -> To] <https://mon.crc.nd.edu/xymon-cgi/historylog.sh?HOST=jim.vectorbase.org&SERVICE=disk&TIMEBUF=1449068268>[image: clear] Wed Dec 2 09:57:48 2015 <HOST> inode [image: green] [image: From -> To] <https://mon.crc.nd.edu/xymon-cgi/historylog.sh?HOST=jim.vectorbase.org&SERVICE=inode&TIMEBUF=1449068268>[image: clear] Wed Dec 2 09:57:48 2015 <HOST> memory [image: green] [image: From -> To] [image: clear]
Any thoughts as to what's going on? Looks like a bug to me...
Thanks!!
This is intentional. It's a result of the normal 'purple' behavior when a box is actually down, but it'll give you this behavior if you're testing a box that's "normally" conn-down (as far as xymon knows) anyway.
You'll want to add the 'noclear' line as well to any of the systems that are not pingable if you want the client tests to actually go purple.
https://www.xymon.com/help/manpages/man5/hosts.cfg.5.html#lbAG
*noclear* *Controls whether stale status messages go purple or clear when a host is down. Normally, when a host is down the client statuses ("cpu", "disk", "memory" etc) will stop updating - this would usually make them go "purple" which can trigger alerts. To avoid that, Xymon checks if the "conn" test has failed, and if that is true then the other tests will go "clear" instead of purple so you only get alerts for the "conn" test. If you do want the stale statuses to go purple, you can use the "noclear" tag to override this behaviour.*
Regards, -jc
On 12/3/2015 9:59 AM, Matt Vander Werf wrote:
Ah, okay. That makes sense! I guess I was thinking that this only happened with other network tests and not the normal tests as well, but that appears to be a different setting. I must have misread or missed the noclear option when I was looking through the documentation....
I think maybe this should be documented better...something like if you use noping then you should also use noclear if you want purple statuses. But that's just my thinking...
Agreed. This should probably be documented a little better. IIRC it's automatic for the old dialup-type configs.
Longer term, there was a TODO item for making conn/disable clear/dependency checking something that can happen centrally at the xymond level across the board.
Right now, the effects of a server being 'conn' down are distributed among both xymonnet (dependency checking to suppress related 'red' TCP checks into 'clear') and xymond (stale xymond_client-derived (or other) tests going 'clear' instead of purple).
It'd seem to make sense to make a 'conn' loss (as well as a 'disable <host>.*') a core host flag with central effects.
-jc
participants (2)
-
cleaver@terabithia.org
-
matt1299@gmail.com