Dynamic "normal" thresholds for CPU, disk, network, etc
(It's generally considered bad form to hijack a thread by changing topic mid-way. I'm posting my response with a new subject, so that the original thread can continue undiluted.)
On 3 October 2013 09:48, Adam Goryachev <mailinglists at websitemanagers.com.au
wrote:
PS, not relevant to this entire discussion, but one thing I've been battling with is trying to define a "normal" status. eg, I can set the CPU load to 5, which normally means the status is always green, but one day the cpu load might be 4 at 2pm, and that is abnormal, even though a load of 4 at 2am is normal. Does anyone use/do anything to automatically watch the current values, and learn what range is "normal" on this day/time? For me, this especially applies to counters related to network performance, disk performance, etc.
RRDs can handle this using Holt-Winters aberrant behaviour detection. I set this up once on an MRTG system, but as yet have not tried to get it working for Xymon-derived RRD files. On my list of things to do.
I think it's fairly easy to add the 6 extra consolidation functions (HWPREDICT, SEASONAL, etc) into the RRD file using rrdtune, or add them to the rrddefinitions.cfg file and recreate the RRD files. Then you just need to adjust the graph definitions to show the expected ranges. I think the tricky part is to specify the consolidation function parameters required to produce useful predications for "normal" based on the nature of the data being collected.
The seminal paper (AFAIK) on this was written by Brutlag and presented at the Usenix "LISA 2000" sysadmin conference. It has example graphs and RRDtool definitions to do this, as well as a complete explanation (deeper than I can grok) of how it all works.
https://www.usenix.org/legacy/events/lisa00/full_papers/brutlag/brutlag_html...
Also, people have described their adventures previously on The List. For example:
http://lists.xymon.com/pipermail/xymon/2012-October/035810.html
If you can get this going to your satisfaction, perhaps you could document what you did and share it with us.
Cheers Jeremy
participants (1)
-
jlaidman@rebel-it.com.au