Thoughts on Usefulness/Reliability of Purple Alerts
On Mon, August 24, 2015 10:15 am, Sean MacGuire wrote:
OK, I'll chime in... I wrote Big Brother so know something about purple alerts.
The purple alert was something no other monitoring system did, and it took care of the problem of a BB (and Xymon) client dropping dead and/or machines being in a zombie state (i.e. responding to pings but otherwise hung).
They're useful as a indication of something being wrong, vs a red or yellow alert which provide a clear and actionable problem ststus.
So purple alerts are awesome, unless your server has lost contact with the clients reporting in and everyone goes purple at the same time resulting in the "Massive Purple Explosion".
Background explanation - the idea was to timestamp reports into the future, and if a client doesn't report in by then, the validity of the last report is in question - it's works the exact same way as the expiration date on a carton of milk - the milk might not have gone bad, yet, but you might want to check it before drinking it.
Matt Vander Werf wrote:
This is primarily for Henrik and J.C., but anyone else is free to chime in their thoughts on this as well!
*Background: We have a Xymon server (the latest Terabithia RPM on RHEL 7) in production that monitors around 1950 hosts (and consistently growing). About a week ago, we experienced some pretty bad purple alert storms in the middle of the night that were all false-positive alerts (over 300 alerts one night). For most of the tests that went purple, they went back to green at the next update interval. At this point, we've been unable to figure out a root cause behind this issue, but it hasn't happened again since early last week (all the easy, understandable possible causes has been ruled out: network load/bandwidth, CPU load of Xymon server and of Xymon clients affected, etc.).
We have been using purple alerts for some time now, find them fairly reliable for the most part, and think they are useful, as machines hang or something similar (causing the Xymon client on the machine to stop being able to report to the Xymon server) and we don't get any red or yellow alerts for any other tests (sometimes a machine can hang but still have a network connection that can be successfully pinged by Xymon, we have found). We haven't had any major issues with false-positive purple alerts (for the most part), or any purple alert storms, since we started using them consistently for all our machines a couple years ago.
I understand that when Xymon was first forked from Big Brother a long while back, it may have been noted that one big change from Big Brother was that you didn't need to do purple alerts (or something like that) and that it was discouraged to use purple alerts, as they were seen as widely unreliable. (I'm hearing this from a coworker of mine, who set up our original Xymon server some 5 years ago, but have been unable to find what he's referring to.) But from what I can see from the current documentation and the mailing list archives, I'm not seeing any place where the use of purple alerts is discouraged due to them being unreliable.
*Question(s): So, I wanted to see what the current thinking/view regarding purple alerts and the use of purple alerts was by both the original main maintainer, Henrik, and the more current main maintainer, J.C. (at least of the current release). Are purple alerts still considered wholly unreliable, or even somewhat unreliable (or were they ever)? Are they discouraged in any way or fashion from being used? Have they caused issues for any of you on this list? Or vice versa: Have they worked well for you? I'm fully aware that this purple alert storm issue we had is just a one-off occurrence and we could have not more additional issues in the future with purple alerts.
I understand that purple alerts are different than other alerts, like red and yellow alerts, in that it is an indication that the Xymon client has stopped working/reporting (on a per-test basis) to the Xymon server for some reason, rather than an issue from a specific test (e.g. with the CPU load, memory, etc.).
*(Possible) Feature Request: In addition, I'd be interested if there was a way that you could only get one alert for a machine if say all the tests for that machine go purple, instead of an alert for each purple test. I don't believe this is possible currently, correct? Is this something that could possibly be implemented in the future? I understand if it's not or if it wouldn't be very easy.
I appreciate your time in answering my questions and look forward to your input! (And apologies for the long-winded e-mail!)
Thanks very much in advance!!
-- Matt Vander Werf
Matt,
Generally speaking, a purple alert should be seen first and foremost as an indication of a failure in the monitoring *system*... where "system" includes the client pushing data up from the various servers you're paying attention to.
By having a calculation made on each message's receipt of how long that message is good for (receipt time + [default, or specified]), we have a "fail safe" for an unknown issue occurring that requires attention. The proximate cause of the purple is the failure to receive a message. Whether that's caused by a hang or death of the usual originator, a bug in a xymonproxy, a cut network cable, or xymond being unable to handle all of the traffic sent to it before it times out, is left somewhat as an exercise for the administrator.
Because purple alerts are generated from xymond's own view of its internal state (calculated once a minute) and are never sent IN to xymond, purple alerts should be a reliable indicator that... some other type of unreliability is going on :)
Because of the wide possibility of different configurations, it's a little dangerous to create a one-size-fits-all strategy for purples. In a typical xymon installation with xymonnet and xymond_client running locally on the same machine, with no proxies or network segments in the middle, and with clients reporting directly in as well, you really shouldn't see any purple alerts outside of clients dying... And if the client is dying because the box is dying, by default you'll only get the 'conn' test red alert instead of the various xymond_client and xymonnet-generated ones (unless you're using the 'noclear' line in hosts.cfg).
Your suggestion to have only a single 'purple' come through would *typically* work, but you'd have to ask yourself which test would be the representative one. In our case, we found it easiest to nominate a specific xymond_client test -- "memory" -- and only send purple notifications for that out to our alert team. That takes care of xymond_client, while leaving esoteric situations caused by the failure of different sharded xymonnet's, xymonproxy's, or custom independent tests free to fail in their own way.
Again, it's the non-typical cases where it gets tricky. What about custom tests that aren't being generated by xymond_client that are still functioning? Perhaps you have xymonnet running on a different machine that's reporting back to xymond (or to a xymonproxy that's reporting back to xymond!) that has failed in some way. And of course, it could be that xymond is under heavy load and is unable to keep up with incoming messages generally (something we experienced in both the TCP and BFQ configs as we were scaling out).
Sort of along these lines, however, I'd been considering having a more "host-wide" way of defining certain failure states directly within xymond, which would allow some of this override logic to happen centrally (and more reliably). Imagine a 'conn' being red optionally causing *all* tests to fail-to-clear, removing the need for this calculation from the remainder of xymonnet tests. Or a true host-wide "disable" that gets applied to all tests, even new ones, as a xymond flag. A host-wide "purple-state" could be conceptualized as well.
That's just a thought, though, and it kind of depends on whether people would find such a feature useful.
Anyway, I hope that's answered some of your questions!
Regards,
-jc
participants (1)
-
cleaver@terabithia.org