"Disable until change"
We often use "disable until ok", but it was brought to my attention that it has burned us from time to time. For example:
Host foo is yellow on disk. But that's ok. We're going to allocate some new storage for it in the next service window. The test is marked "disable until ok". But before the service window arrives, something chews up a whole bunch of disk and the now-red test continues to be blue because the test is not yet ok.
We sometimes use "acknowledge" for this function, but the non-green screen can get kind of cluttered this way.
Does anyone have a good way to fake "disable while status remains unchanged"?
Do things because you should, not just because you can.
John Thurston 907-465-8591 John.Thurston at alaska.gov Enterprise Technology Services Department of Administration State of Alaska
I personally do not think using disable is a good idea for unplanned problems. For one, if you use the reporting features, you will be mixing planned and unplanned downtime together. Disable is really for times when you know exactly what is going on with the system, and alerting is not needed/someone is watching the system manually. That's my take on it anyway, and what I tell the people that work with me.
____ *Note: UMDNJ is now Rutgers-Biomedical and Health Sciences* || \\UTGERS |---------------------*O*--------------------- ||_// Biomedical | Ryan Novosielski - Senior Technologist || \\ and Health | novosirj at rutgers.edu<mailto:novosirj at rutgers.edu>- 973/972.0922 (2x0922) || \\ Sciences | OIRT/High Perf & Res Comp - MSB C630, Newark `'
On Nov 2, 2015, at 18:59, John Thurston <john.thurston at alaska.gov<mailto:john.thurston at alaska.gov>> wrote:
We often use "disable until ok", but it was brought to my attention that it has burned us from time to time. For example:
Host foo is yellow on disk. But that's ok. We're going to allocate some new storage for it in the next service window. The test is marked "disable until ok". But before the service window arrives, something chews up a whole bunch of disk and the now-red test continues to be blue because the test is not yet ok.
We sometimes use "acknowledge" for this function, but the non-green screen can get kind of cluttered this way.
Does anyone have a good way to fake "disable while status remains unchanged"?
Do things because you should, not just because you can.
John Thurston 907-465-8591 John.Thurston at alaska.gov<mailto:John.Thurston at alaska.gov> Enterprise Technology Services Department of Administration State of Alaska
Xymon mailing list Xymon at xymon.com<mailto:Xymon at xymon.com> http://lists.xymon.com/mailman/listinfo/xymon
I'd agree that disable is intended more as a human override about the alertability of a host+service combo. The acknowledge functionality is more in line with what it seems you're looking for: "It's still Yellow, still keep track of things, but don't alert downstream unless something explicitly wants to."
If the issue is with the nongreen page, I believe there should be a way to remove ack'd items from that page (but it might require running a second instance of xymongen just to spit out that page, potentially with a BOARDFILTER in there to limit it further).
"Disable until Change" would be possible, but we'd need to store the actual underlying color to compare the incoming report to, since disabling works by overriding the color that was sent and forcing it blue. "Unack on Change" works precisely because we still have a meaningful current color to compare an incoming message to.
-jc
On 11/2/2015 4:21 PM, Novosielski, Ryan wrote:
I personally do not think using disable is a good idea for unplanned problems. For one, if you use the reporting features, you will be mixing planned and unplanned downtime together. Disable is really for times when you know exactly what is going on with the system, and alerting is not needed/someone is watching the system manually. That's my take on it anyway, and what I tell the people that work with me.
____ *Note: UMDNJ is now Rutgers-Biomedical and Health Sciences* || \\UTGERS |---------------------*O*--------------------- ||_// Biomedical | Ryan Novosielski - Senior Technologist || \\ and Health | novosirj at rutgers.edu <mailto:novosirj at rutgers.edu>- 973/972.0922 (2x0922) || \\ Sciences | OIRT/High Perf & Res Comp - MSB C630, Newark `'
On Nov 2, 2015, at 18:59, John Thurston <john.thurston at alaska.gov <mailto:john.thurston at alaska.gov>> wrote:
We often use "disable until ok", but it was brought to my attention that it has burned us from time to time. For example:
Host foo is yellow on disk. But that's ok. We're going to allocate some new storage for it in the next service window. The test is marked "disable until ok". But before the service window arrives, something chews up a whole bunch of disk and the now-red test continues to be blue because the test is not yet ok.
We sometimes use "acknowledge" for this function, but the non-green screen can get kind of cluttered this way.
Does anyone have a good way to fake "disable while status remains unchanged"?
I wouldn't want disable until change, FYI, so I'd recommend it be optional if at all. I use disable specifically for equipment that is out of service and will be up and down, etc.
____ *Note: UMDNJ is now Rutgers-Biomedical and Health Sciences* || \UTGERS |---------------------*O*--------------------- ||_// Biomedical | Ryan Novosielski - Senior Technologist || \ and Health | novosirj at rutgers.edu<mailto:novosirj at rutgers.edu>- 973/972.0922 (2x0922) || \ Sciences | OIRT/High Perf & Res Comp - MSB C630, Newark `'
On Nov 3, 2015, at 13:42, Japheth Cleaver <cleaver at terabithia.org<mailto:cleaver at terabithia.org>> wrote:
I'd agree that disable is intended more as a human override about the alertability of a host+service combo. The acknowledge functionality is more in line with what it seems you're looking for: "It's still Yellow, still keep track of things, but don't alert downstream unless something explicitly wants to."
If the issue is with the nongreen page, I believe there should be a way to remove ack'd items from that page (but it might require running a second instance of xymongen just to spit out that page, potentially with a BOARDFILTER in there to limit it further).
"Disable until Change" would be possible, but we'd need to store the actual underlying color to compare the incoming report to, since disabling works by overriding the color that was sent and forcing it blue. "Unack on Change" works precisely because we still have a meaningful current color to compare an incoming message to.
-jc
On 11/2/2015 4:21 PM, Novosielski, Ryan wrote: I personally do not think using disable is a good idea for unplanned problems. For one, if you use the reporting features, you will be mixing planned and unplanned downtime together. Disable is really for times when you know exactly what is going on with the system, and alerting is not needed/someone is watching the system manually. That's my take on it anyway, and what I tell the people that work with me.
____ *Note: UMDNJ is now Rutgers-Biomedical and Health Sciences* || \\UTGERS |---------------------*O*--------------------- ||_// Biomedical | Ryan Novosielski - Senior Technologist || \\ and Health | novosirj at rutgers.edu<mailto:novosirj at rutgers.edu>- 973/972.0922 (2x0922) || \\ Sciences | OIRT/High Perf & Res Comp - MSB C630, Newark `'
On Nov 2, 2015, at 18:59, John Thurston <<mailto:john.thurston at alaska.gov>john.thurston at alaska.gov<mailto:john.thurston at alaska.gov>> wrote:
We often use "disable until ok", but it was brought to my attention that it has burned us from time to time. For example:
Host foo is yellow on disk. But that's ok. We're going to allocate some new storage for it in the next service window. The test is marked "disable until ok". But before the service window arrives, something chews up a whole bunch of disk and the now-red test continues to be blue because the test is not yet ok.
We sometimes use "acknowledge" for this function, but the non-green screen can get kind of cluttered this way.
Does anyone have a good way to fake "disable while status remains unchanged"?
On 11/3/2015 9:42 AM, Japheth Cleaver wrote:
I'd agree that disable is intended more as a human override about the alertability of a host+service combo. The acknowledge functionality is more in line with what it seems you're looking for: "It's still Yellow, still keep track of things, but don't alert downstream unless something explicitly wants to."
If the issue is with the nongreen page, I believe there should be a way to remove ack'd items from that page
Ahhh. I now see NOPROPACK:[+|-]testname[,[+|-]testname] for hosts.cfg. I think this might help us get what we want.
-- Do things because you should, not just because you can.
John Thurston 907-465-8591 John.Thurston at alaska.gov Enterprise Technology Services Department of Administration State of Alaska
Yeah we've been burned by this multiple times as well. I would like to see this feature also.
Scot Kreienkamp
Scot Kreienkamp | Senior Systems Engineer | La-Z-Boy Corporate One La-Z-Boy Drive | Monroe, Michigan 48162 | Office: 734-384-6403 | | Mobile: 7349151444 | Email: Scot.Kreienkamp at la-z-boy.com
On Nov 2, 2015, at 6:59 PM, John Thurston <john.thurston at alaska.gov> wrote:
We often use "disable until ok", but it was brought to my attention that it has burned us from time to time. For example:
Host foo is yellow on disk. But that's ok. We're going to allocate some new storage for it in the next service window. The test is marked "disable until ok". But before the service window arrives, something chews up a whole bunch of disk and the now-red test continues to be blue because the test is not yet ok.
We sometimes use "acknowledge" for this function, but the non-green screen can get kind of cluttered this way.
Does anyone have a good way to fake "disable while status remains unchanged"?
Do things because you should, not just because you can.
John Thurston 907-465-8591 John.Thurston at alaska.gov Enterprise Technology Services Department of Administration State of Alaska
Xymon mailing list Xymon at xymon.com http://lists.xymon.com/mailman/listinfo/xymon
This message is intended only for the individual or entity to which it is addressed. It may contain privileged, confidential information which is exempt from disclosure under applicable laws. If you are not the intended recipient, you are strictly prohibited from disseminating or distributing this information (other than to the intended recipient) or copying this information. If you have received this communication in error, please notify us immediately by e-mail or by telephone at the above number. Thank you.
This is in line with an similar problem that I have encountered.
Consider a situation where you have a server with multiple drives (are there any with one anymore) and it goes to yellow or red and you ack that alert for X amount of time.
What happens if another drive goes yellow/red during that time? It gets ignored. This applies to any test were there are multiple opportunities to have something trigger an alert. These tests are typically disk, procs, and svcs (you could argue memory also).
I created the below perl script which I simply run from cron which tracks what warnings/alerts have been ack'd, saves the state of what is in an alert condition, and looks for changes during the ack period. If a change is detected, in order to drop the ack, I do a quick switch to green and the next update will go back to the appropriate alert level. This will then generate a new warning/alert.
This could be adapted to include tests that are disabled but it is not currently included within the script. You are more than welcome to make adjustments as you see fit.
#!/usr/bin/perl
------------------------------------------------------------------------------------------------
Script Name: watch_ackd_alerts.pl
Author: John Rothlisberger (john.r.rothlisberger at accenture.com)
Created On: March 10, 2014
VERSION="1.04092014.01";
------------------------------------------------------------------------------------------------
Purpose: A script to monitor ack'd alerts and watch for changes.
Example: The C: drive fills up and sends out a red alert. Knowing this will
take some time to fix you ack the alert for 60 minutes. If, during that 60
minute window the D: drive fills up you will not be notified as the 'disk' test
has been acknowledged. This script is an attempt to short circuit the ack and
allow for the new alert to be sent out.
------------------------------------------------------------------------------------------------
Execution: Run every 5 minutes from xymon crontab:
*/5 * * * * /home/xymon/bin/watch_ackd_alerts.pl > /dev/null 2>&1
The following directories need to be present:
/home/xymon/server/tmp/ACK_WATCH
Logging directory is assumed to be /home/xymon/logs.
------------------------------------------------------------------------------------------------
Setup COUNT and directory where to store ack info files.
$COUNT=0; $ACKSDIR="/home/xymon/server/tmp/ACK_WATCH";
Log file
open(LOGFILE,">> /home/xymon/logs/ack_watch.log") || die("can't open ack_watch.log: $!");
input for "ALERTS" from xymondboard is in the following form:
servername|test|color|flags|lastchange|logtime|validtime|acktime|disabletime|sender|cookie|line1|ackmsg|dismsg|msg
Open input file
open ALERTS, "/home/xymon/server/bin/xymon 0 'xymondboard color=yellow,red' |" or die "Couldn't execute: $!"; #for testing #open ALERTS, "</home/xymon/server/tmp/ACK_WATCH/ALERT_INIT" or die "Couldn't execute: $!";
Parse all active alerts
while (<ALERTS>) { chomp; @LINE=split(/\|/,$_); $SERVERNAME=@LINE[0]; $TESTTYPE=@LINE[1]; $COLOR=@LINE[2]; $LASTCHANGE=@LINE[4]; $LOGTIME=@LINE[5]; $VALTIME=@LINE[6]; $ACKTIME=@LINE[7]; $DISTIME=@LINE[8]; $COOKIE=@LINE[10]; $LINE1=@LINE[11]; $ACKMSG=@LINE[12]; $DISMSG=@LINE[13]; $MSG=@LINE[14];
Skip all alerts except disk, procs, and svcs (others are not tested)
if ("$TESTTYPE" ne "svcs" && "$TESTTYPE" ne "disk" && "$TESTTYPE" ne "procs") { next; }
Alert has been ack'd if ACKTIME is > 0. This is where we watch for changes.
if ( $ACKTIME > 0) { $COUNT+=1; $REDS=0; $YELLOWS=0; $REDS_CMP=0; $YELLOWS_CMP=0; $NEED_COMP=0;
$now = localtime;
$ENDTIME=localtime($ACKTIME);
print LOGFILE "---------------------------- $now ----------------------------\n";
print LOGFILE "Line: @LINE\n";
print LOGFILE "SERVERNAME: $SERVERNAME\n";
print LOGFILE "TESTTYPE: $TESTTYPE\n";
print LOGFILE "COLOR: $COLOR\n";
print LOGFILE "End Time: $ENDTIME\n";
If this is a new ack'd alert we will create a static file that holds current test state.
We will use this file to decide if there have been changes to what has been ack'd.
if (! -e "${ACKSDIR}/${SERVERNAME}${TESTTYPE}${COLOR}${ACKTIME}" ) {
We need to get current details of alert that has been ack'd and store in DETAILS.
open DETAILS, "/home/xymon/server/bin/xymon 0 'xymondlog ${SERVERNAME}.${TESTTYPE}' |" or die "Couldn't execute: $!";
for testing
#open DETAILS, "</home/xymon/server/tmp/ACK_WATCH/ALERT_DETAILS" or die "Couldn't execute: $!";
Create a new static file with current ack details.
open OUTFILE, ">${ACKSDIR}/${SERVERNAME}${TESTTYPE}${COLOR}${ACKTIME}" or die "Couldn't execute: $!";
while (<DETAILS>) {
chomp;
if ( $_ =~ /^&/ ) {
$_ =~ s/\&//;
@DETLINE=split(/ /,$_);
Change colors to numbers red=2 yellow=1 anything else = 0
if ( "$DETLINE[0]" eq "red" ) {
$COL_VALUE = "2";
} elsif ( "$DETLINE[0]" eq "yellow" ) {
$COL_VALUE = "1";
} else {
$COL_VALUE = "0";
}
Create the static file which will be used on subsequent runs.
print OUTFILE "${COL_VALUE}:${DETLINE[1]}\n";
print LOGFILE "DATA: ${COL_VALUE}:${DETLINE[1]}\n";
}
}
close OUTFILE;
We have already recorded the initial state of the test and saved it to a file.
Now we will check new status output with that file to see if the alerts have changed.
This is where we will now look to see if changes have occurred since the alert was ack'd.
} else {
Get current alert state details and use to compare to saved file
open DETAILS, "/home/xymon/server/bin/xymon 0 'xymondlog ${SERVERNAME}.${TESTTYPE}' |" or die "Couldn't execute: $!";
for testing
#open DETAILS, "</home/xymon/server/tmp/ACK_WATCH/ALERT_DETAILS2" or die "Couldn't execute: $!";
Save the current alert state if needed after disabling of ack.
$SAVED_STATUS="\nALERT or WARNING status has changed from time of ACK!\nACK IS CANCELED\n";;
while (<DETAILS>) {
chomp;
if ( $_ =~ /^&/ ) {
$_ =~ s/\&//;
@DETLINE=split(/ /,$_);
Change colors to numbers red=2 yellow=1 anything else = 0
if ( "$DETLINE[0]" eq "red" ) {
$COL_VALUE = "2";
} elsif ( "$DETLINE[0]" eq "yellow" ) {
$COL_VALUE = "1";
} else {
$COL_VALUE = "0";
}
push (@COMP_contents, "${COL_VALUE}:${DETLINE[1]}");
$SAVED_STATUS.="$_\n";
} elsif ( $_ !~ /\|\|/ ) {
$SAVED_STATUS.="$_\n";
}
}
Load the initial alert ack static file.
open INITFILE, "<${ACKSDIR}/${SERVERNAME}${TESTTYPE}${COLOR}${ACKTIME}" or die "Couldn't execute: $!";
while (<INITFILE>) {
chomp;
push (@INITFILE_contents, "$_");
}
close INITFILE;
Create a hash that contains the initial ack file.
%INITF = map(($_,1), at INITFILE_contents);
%COMP = map(($_,1), at COMP_contents);
foreach (@COMP_contents) {
if ($INITF{$_}) {
No change to the alert - nothing to do.
print LOGFILE "Alert hasn't changed: $_\n";
} else {
Alert has changed in some form.
print LOGFILE "Alert has changed: $_\n";
@CURRENT=split(/:/,$_);
$CUR_COLOR=$CURRENT[0];
$CUR_TEST=$CURRENT[1];
@ACKD_EVENT=grep (/:${CUR_TEST}/, @INITFILE_contents);
@ACK_EVENT=split(/:/,$ACKD_EVENT[0]);
$ACK_COLOR=$ACKD_EVENT[0];
$ACK_TEST=$ACKD_EVENT[1];
Compare the current alert color with that which was saved initially.
if ( $CUR_COLOR < $ACK_EVENT[0] ) {
New color is lower than initial color - leave ack alone.
print LOGFILE "NO ACTION NEEDED (new level lower than ack level).\n";
} elsif ( $CUR_COLOR > $ACK_EVENT[0] ) {
New color is greater than initial ack color, dump ack so new alerts can be sent.
if ( $ACK_COLOR == "" ) {
New alert not previously detected (different service, process, or disk alerting)
print LOGFILE "ACK COLOR $ACK_COLOR\n";
print LOGFILE "NEW ALERT - TERMINATE ACK AND SEND NEW ALERT.\n";
Reset the server.test status to green. Next update will reset the alert condition effectivly
canceling the acknowledge.
open RESET, "/home/xymon/server/bin/xymon 0 'status+1 ${SERVERNAME}.${TESTTYPE} green Ack Reset New Alert Rcvd.' |" or die "Couldn't execute: $!";
print LOGFILE "Set status1: status+1 ${SERVERNAME}.${TESTTYPE} green Ack Reset New Alert Rcvd.\n";
close RESET;
open NEWALERT, "/home/xymon/server/bin/xymon 0 'status ${SERVERNAME}.${TESTTYPE} $DETLINE[0] $SAVED_STATUS' |" or die "Couldn't execute: $!";
print LOGFILE "Set status2: satus ${SERVERNAME}.${TESTTYPE} $DETLINE[0] $SAVED_STATUS\n";
close NEWALERT;
exit 0;
} else {
Level of original alert has upgraded (typically yellow->red)
print LOGFILE "ACK COLOR $ACK_COLOR\n";
print LOGFILE "OLD ALERT - TERMINATE ACK AND SEND NEW ALERT.\n";
Reset the server.test status to green. Next update will reset the alert condition effectivly
canceling the acknowledge.
open RESET, "/home/xymon/server/bin/xymon 0 'status+1 ${SERVERNAME}.${TESTTYPE} green Ack Reset Alert Level Changed.' |" or die "Couldn't execute: $!";
print LOGFILE "Set status3: status+1 ${SERVERNAME}.${TESTTYPE} green Ack Reset Alert Level Changed.\n";
close RESET;
open NEWALERT, "/home/xymon/server/bin/xymon 0 'status ${SERVERNAME}.${TESTTYPE} $DETLINE[0] $SAVED_STATUS' |" or die "Couldn't execute: $!";
print LOGFILE "Set status4: status ${SERVERNAME}.${TESTTYPE} $DETLINE[0] $SAVED_STATUS\n";
close NEWALERT;
exit 0;
}
} else {
Nothing to do here.
print "NO ACTION TAKEN\n"; print LOGFILE "NO ACTION NEEDED (new level equals ack level).\n"; } } } } } }
When there are no ack'd alerts clean out the ACK status directory.
if ( $COUNT == 0 ) { unlink glob "${ACKSDIR}/*"; }
Thanks, John Upcoming PTO:
John Rothlisberger IT Strategy, Infrastructure & Security - Technology Growth Platform TGP for Business Process Outsourcing Accenture 312.693.3136 office
-----Original Message----- From: Xymon [mailto:xymon-bounces at xymon.com] On Behalf Of John Thurston Sent: Monday, November 02, 2015 5:59 PM To: xymon at xymon.com Subject: [Xymon] "Disable until change"
We often use "disable until ok", but it was brought to my attention that it has burned us from time to time. For example:
Host foo is yellow on disk. But that's ok. We're going to allocate some new storage for it in the next service window. The test is marked "disable until ok". But before the service window arrives, something chews up a whole bunch of disk and the now-red test continues to be blue because the test is not yet ok.
We sometimes use "acknowledge" for this function, but the non-green screen can get kind of cluttered this way.
Does anyone have a good way to fake "disable while status remains unchanged"?
Do things because you should, not just because you can.
John Thurston 907-465-8591 John.Thurston at alaska.gov Enterprise Technology Services Department of Administration State of Alaska
Xymon mailing list Xymon at xymon.com http://lists.xymon.com/mailman/listinfo/xymon
This message is for the designated recipient only and may contain privileged, proprietary, or otherwise confidential information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the e-mail by you is prohibited. Where allowed by local law, electronic communications with Accenture and its affiliates, including e-mail and instant messaging (including content), may be scanned by our systems for the purposes of information security and assessment of internal compliance with Accenture policy.
www.accenture.com
That's pretty cool. I'll have to look this one over.
Wouldn't it be better to enable rather than change the status to green?
-----Original Message----- From: Xymon [mailto:xymon-bounces at xymon.com] On Behalf Of john.r.rothlisberger at accenture.com Sent: Tuesday, November 03, 2015 9:02 AM To: john.thurston at alaska.gov; xymon at xymon.com Subject: Re: [Xymon] "Disable until change"
This is in line with an similar problem that I have encountered.
Consider a situation where you have a server with multiple drives (are there any with one anymore) and it goes to yellow or red and you ack that alert for X amount of time.
What happens if another drive goes yellow/red during that time? It gets ignored. This applies to any test were there are multiple opportunities to have something trigger an alert. These tests are typically disk, procs, and svcs (you could argue memory also).
I created the below perl script which I simply run from cron which tracks what warnings/alerts have been ack'd, saves the state of what is in an alert condition, and looks for changes during the ack period. If a change is detected, in order to drop the ack, I do a quick switch to green and the next update will go back to the appropriate alert level. This will then generate a new warning/alert.
This could be adapted to include tests that are disabled but it is not currently included within the script. You are more than welcome to make adjustments as you see fit.
#!/usr/bin/perl
------------------------------------------------------------------------------------------------
Script Name: watch_ackd_alerts.pl
Author: John Rothlisberger (john.r.rothlisberger at accenture.com)
Created On: March 10, 2014
VERSION="1.04092014.01";
------------------------------------------------------------------------------------------------
Purpose: A script to monitor ack'd alerts and watch for changes.
Example: The C: drive fills up and sends out a red alert. Knowing this will
take some time to fix you ack the alert for 60 minutes. If, during that 60
minute window the D: drive fills up you will not be notified as the 'disk' test
has been acknowledged. This script is an attempt to short circuit the ack and
allow for the new alert to be sent out.
------------------------------------------------------------------------------------------------
Execution: Run every 5 minutes from xymon crontab:
*/5 * * * * /home/xymon/bin/watch_ackd_alerts.pl > /dev/null 2>&1
The following directories need to be present:
/home/xymon/server/tmp/ACK_WATCH
Logging directory is assumed to be /home/xymon/logs.
------------------------------------------------------------------------------------------------
Setup COUNT and directory where to store ack info files.
$COUNT=0; $ACKSDIR="/home/xymon/server/tmp/ACK_WATCH";
Log file
open(LOGFILE,">> /home/xymon/logs/ack_watch.log") || die("can't open ack_watch.log: $!");
input for "ALERTS" from xymondboard is in the following form:
servername|test|color|flags|lastchange|logtime|validtime|acktime|disabletime|sender|cookie|line1|ackmsg|dismsg|msg
Open input file
open ALERTS, "/home/xymon/server/bin/xymon 0 'xymondboard color=yellow,red' |" or die "Couldn't execute: $!"; #for testing #open ALERTS, "</home/xymon/server/tmp/ACK_WATCH/ALERT_INIT" or die "Couldn't execute: $!";
Parse all active alerts
while (<ALERTS>) { chomp; @LINE=split(/\|/,$_); $SERVERNAME=@LINE[0]; $TESTTYPE=@LINE[1]; $COLOR=@LINE[2]; $LASTCHANGE=@LINE[4]; $LOGTIME=@LINE[5]; $VALTIME=@LINE[6]; $ACKTIME=@LINE[7]; $DISTIME=@LINE[8]; $COOKIE=@LINE[10]; $LINE1=@LINE[11]; $ACKMSG=@LINE[12]; $DISMSG=@LINE[13]; $MSG=@LINE[14];
Skip all alerts except disk, procs, and svcs (others are not tested)
if ("$TESTTYPE" ne "svcs" && "$TESTTYPE" ne "disk" && "$TESTTYPE" ne "procs") { next; }
Alert has been ack'd if ACKTIME is > 0. This is where we watch for changes.
if ( $ACKTIME > 0) { $COUNT+=1; $REDS=0; $YELLOWS=0; $REDS_CMP=0; $YELLOWS_CMP=0; $NEED_COMP=0;
$now = localtime;
$ENDTIME=localtime($ACKTIME);
print LOGFILE "---------------------------- $now ----------------------------\n";
print LOGFILE "Line: @LINE\n";
print LOGFILE "SERVERNAME: $SERVERNAME\n";
print LOGFILE "TESTTYPE: $TESTTYPE\n";
print LOGFILE "COLOR: $COLOR\n";
print LOGFILE "End Time: $ENDTIME\n";
If this is a new ack'd alert we will create a static file that holds current test state.
We will use this file to decide if there have been changes to what has been ack'd.
if (! -e "${ACKSDIR}/${SERVERNAME}${TESTTYPE}${COLOR}${ACKTIME}" ) {
We need to get current details of alert that has been ack'd and store in DETAILS.
open DETAILS, "/home/xymon/server/bin/xymon 0 'xymondlog ${SERVERNAME}.${TESTTYPE}' |" or die "Couldn't execute: $!";
for testing
#open DETAILS, "</home/xymon/server/tmp/ACK_WATCH/ALERT_DETAILS" or die "Couldn't execute: $!";
Create a new static file with current ack details.
open OUTFILE, ">${ACKSDIR}/${SERVERNAME}${TESTTYPE}${COLOR}${ACKTIME}" or die "Couldn't execute: $!";
while (<DETAILS>) {
chomp;
if ( $_ =~ /^&/ ) {
$_ =~ s/\&//;
@DETLINE=split(/ /,$_);
Change colors to numbers red=2 yellow=1 anything else = 0
if ( "$DETLINE[0]" eq "red" ) {
$COL_VALUE = "2";
} elsif ( "$DETLINE[0]" eq "yellow" ) {
$COL_VALUE = "1";
} else {
$COL_VALUE = "0";
}
Create the static file which will be used on subsequent runs.
print OUTFILE "${COL_VALUE}:${DETLINE[1]}\n";
print LOGFILE "DATA: ${COL_VALUE}:${DETLINE[1]}\n";
}
}
close OUTFILE;
We have already recorded the initial state of the test and saved it to a file.
Now we will check new status output with that file to see if the alerts have changed.
This is where we will now look to see if changes have occurred since the alert was ack'd.
} else {
Get current alert state details and use to compare to saved file
open DETAILS, "/home/xymon/server/bin/xymon 0 'xymondlog ${SERVERNAME}.${TESTTYPE}' |" or die "Couldn't execute: $!";
for testing
#open DETAILS, "</home/xymon/server/tmp/ACK_WATCH/ALERT_DETAILS2" or die "Couldn't execute: $!";
Save the current alert state if needed after disabling of ack.
$SAVED_STATUS="\nALERT or WARNING status has changed from time of ACK!\nACK IS CANCELED\n";;
while (<DETAILS>) {
chomp;
if ( $_ =~ /^&/ ) {
$_ =~ s/\&//;
@DETLINE=split(/ /,$_);
Change colors to numbers red=2 yellow=1 anything else = 0
if ( "$DETLINE[0]" eq "red" ) {
$COL_VALUE = "2";
} elsif ( "$DETLINE[0]" eq "yellow" ) {
$COL_VALUE = "1";
} else {
$COL_VALUE = "0";
}
push (@COMP_contents, "${COL_VALUE}:${DETLINE[1]}");
$SAVED_STATUS.="$_\n";
} elsif ( $_ !~ /\|\|/ ) {
$SAVED_STATUS.="$_\n";
}
}
Load the initial alert ack static file.
open INITFILE, "<${ACKSDIR}/${SERVERNAME}${TESTTYPE}${COLOR}${ACKTIME}" or die "Couldn't execute: $!";
while (<INITFILE>) {
chomp;
push (@INITFILE_contents, "$_");
}
close INITFILE;
Create a hash that contains the initial ack file.
%INITF = map(($_,1), at INITFILE_contents);
%COMP = map(($_,1), at COMP_contents);
foreach (@COMP_contents) {
if ($INITF{$_}) {
No change to the alert - nothing to do.
print LOGFILE "Alert hasn't changed: $_\n";
} else {
Alert has changed in some form.
print LOGFILE "Alert has changed: $_\n";
@CURRENT=split(/:/,$_);
$CUR_COLOR=$CURRENT[0];
$CUR_TEST=$CURRENT[1];
@ACKD_EVENT=grep (/:${CUR_TEST}/, @INITFILE_contents);
@ACK_EVENT=split(/:/,$ACKD_EVENT[0]);
$ACK_COLOR=$ACKD_EVENT[0];
$ACK_TEST=$ACKD_EVENT[1];
Compare the current alert color with that which was saved initially.
if ( $CUR_COLOR < $ACK_EVENT[0] ) {
New color is lower than initial color - leave ack alone.
print LOGFILE "NO ACTION NEEDED (new level lower than ack level).\n";
} elsif ( $CUR_COLOR > $ACK_EVENT[0] ) {
New color is greater than initial ack color, dump ack so new alerts can be sent.
if ( $ACK_COLOR == "" ) {
New alert not previously detected (different service, process, or disk alerting)
print LOGFILE "ACK COLOR $ACK_COLOR\n";
print LOGFILE "NEW ALERT - TERMINATE ACK AND SEND NEW ALERT.\n";
Reset the server.test status to green. Next update will reset the alert condition effectivly
canceling the acknowledge.
open RESET, "/home/xymon/server/bin/xymon 0 'status+1 ${SERVERNAME}.${TESTTYPE} green Ack Reset New Alert Rcvd.' |" or die "Couldn't execute: $!";
print LOGFILE "Set status1: status+1 ${SERVERNAME}.${TESTTYPE} green Ack Reset New Alert Rcvd.\n";
close RESET;
open NEWALERT, "/home/xymon/server/bin/xymon 0 'status ${SERVERNAME}.${TESTTYPE} $DETLINE[0] $SAVED_STATUS' |" or die "Couldn't execute: $!";
print LOGFILE "Set status2: satus ${SERVERNAME}.${TESTTYPE} $DETLINE[0] $SAVED_STATUS\n";
close NEWALERT;
exit 0;
} else {
Level of original alert has upgraded (typically yellow->red)
print LOGFILE "ACK COLOR $ACK_COLOR\n";
print LOGFILE "OLD ALERT - TERMINATE ACK AND SEND NEW ALERT.\n";
Reset the server.test status to green. Next update will reset the alert condition effectivly
canceling the acknowledge.
open RESET, "/home/xymon/server/bin/xymon 0 'status+1 ${SERVERNAME}.${TESTTYPE} green Ack Reset Alert Level Changed.' |" or die "Couldn't execute: $!";
print LOGFILE "Set status3: status+1 ${SERVERNAME}.${TESTTYPE} green Ack Reset Alert Level Changed.\n";
close RESET;
open NEWALERT, "/home/xymon/server/bin/xymon 0 'status ${SERVERNAME}.${TESTTYPE} $DETLINE[0] $SAVED_STATUS' |" or die "Couldn't execute: $!";
print LOGFILE "Set status4: status ${SERVERNAME}.${TESTTYPE} $DETLINE[0] $SAVED_STATUS\n";
close NEWALERT;
exit 0;
}
} else {
Nothing to do here.
print "NO ACTION TAKEN\n"; print LOGFILE "NO ACTION NEEDED (new level equals ack level).\n"; } } } } } }
When there are no ack'd alerts clean out the ACK status directory.
if ( $COUNT == 0 ) { unlink glob "${ACKSDIR}/*"; }
Thanks, John Upcoming PTO:
John Rothlisberger IT Strategy, Infrastructure & Security - Technology Growth Platform TGP for Business Process Outsourcing Accenture 312.693.3136 office
-----Original Message----- From: Xymon [mailto:xymon-bounces at xymon.com] On Behalf Of John Thurston Sent: Monday, November 02, 2015 5:59 PM To: xymon at xymon.com Subject: [Xymon] "Disable until change"
We often use "disable until ok", but it was brought to my attention that it has burned us from time to time. For example:
Host foo is yellow on disk. But that's ok. We're going to allocate some new storage for it in the next service window. The test is marked "disable until ok". But before the service window arrives, something chews up a whole bunch of disk and the now-red test continues to be blue because the test is not yet ok.
We sometimes use "acknowledge" for this function, but the non-green screen can get kind of cluttered this way.
Does anyone have a good way to fake "disable while status remains unchanged"?
Do things because you should, not just because you can.
John Thurston 907-465-8591 John.Thurston at alaska.gov Enterprise Technology Services Department of Administration State of Alaska
Xymon mailing list Xymon at xymon.com http://lists.xymon.com/mailman/listinfo/xymon
This message is for the designated recipient only and may contain privileged, proprietary, or otherwise confidential information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the e-mail by you is prohibited. Where allowed by local law, electronic communications with Accenture and its affiliates, including e-mail and instant messaging (including content), may be scanned by our systems for the purposes of information security and assessment of internal compliance with Accenture policy.
www.accenture.com
Xymon mailing list Xymon at xymon.com http://lists.xymon.com/mailman/listinfo/xymon This communication is the property of CenturyLink and may contain confidential or privileged information. Unauthorized use of this communication is strictly prohibited and may be unlawful. If you have received this communication in error, please immediately notify the sender by reply e-mail and destroy all copies of the communication and any attachments.
Enable is for tests that have been disabled - would that work for an ack? I tried several different methods and this one was the most palatable that I could easily work with.
Thanks, John Upcoming PTO:
John Rothlisberger IT Strategy, Infrastructure & Security - Technology Growth Platform TGP for Business Process Outsourcing Accenture 312.693.3136 office
-----Original Message----- From: Root, Paul T [mailto:Paul.Root at CenturyLink.com] Sent: Tuesday, November 03, 2015 10:01 AM To: Rothlisberger, John R. <john.r.rothlisberger at accenture.com>; 'john.thurston at alaska.gov' <john.thurston at alaska.gov>; 'xymon at xymon.com' <xymon at xymon.com> Subject: RE: [Xymon] "Disable until change"
That's pretty cool. I'll have to look this one over.
Wouldn't it be better to enable rather than change the status to green?
-----Original Message----- From: Xymon [mailto:xymon-bounces at xymon.com] On Behalf Of john.r.rothlisberger at accenture.com Sent: Tuesday, November 03, 2015 9:02 AM To: john.thurston at alaska.gov; xymon at xymon.com Subject: Re: [Xymon] "Disable until change"
This is in line with an similar problem that I have encountered.
Consider a situation where you have a server with multiple drives (are there any with one anymore) and it goes to yellow or red and you ack that alert for X amount of time.
What happens if another drive goes yellow/red during that time? It gets ignored. This applies to any test were there are multiple opportunities to have something trigger an alert. These tests are typically disk, procs, and svcs (you could argue memory also).
I created the below perl script which I simply run from cron which tracks what warnings/alerts have been ack'd, saves the state of what is in an alert condition, and looks for changes during the ack period. If a change is detected, in order to drop the ack, I do a quick switch to green and the next update will go back to the appropriate alert level. This will then generate a new warning/alert.
This could be adapted to include tests that are disabled but it is not currently included within the script. You are more than welcome to make adjustments as you see fit.
#!/usr/bin/perl
------------------------------------------------------------------------------------------------
Script Name: watch_ackd_alerts.pl
Author: John Rothlisberger (john.r.rothlisberger at accenture.com)
Created On: March 10, 2014
VERSION="1.04092014.01";
------------------------------------------------------------------------------------------------
Purpose: A script to monitor ack'd alerts and watch for changes.
Example: The C: drive fills up and sends out a red alert. Knowing this will
take some time to fix you ack the alert for 60 minutes. If, during that 60
minute window the D: drive fills up you will not be notified as the 'disk' test
has been acknowledged. This script is an attempt to short circuit the ack and
allow for the new alert to be sent out.
------------------------------------------------------------------------------------------------
Execution: Run every 5 minutes from xymon crontab:
*/5 * * * * /home/xymon/bin/watch_ackd_alerts.pl > /dev/null 2>&1 # The following directories need to be present:
/home/xymon/server/tmp/ACK_WATCH
Logging directory is assumed to be /home/xymon/logs.
------------------------------------------------------------------------------------------------
Setup COUNT and directory where to store ack info files.
$COUNT=0; $ACKSDIR="/home/xymon/server/tmp/ACK_WATCH";
Log file
open(LOGFILE,">> /home/xymon/logs/ack_watch.log") || die("can't open ack_watch.log: $!");
input for "ALERTS" from xymondboard is in the following form:
servername|test|color|flags|lastchange|logtime|validtime|acktime|disabletime|sender|cookie|line1|ackmsg|dismsg|msg
Open input file
open ALERTS, "/home/xymon/server/bin/xymon 0 'xymondboard color=yellow,red' |" or die "Couldn't execute: $!"; #for testing #open ALERTS, "</home/xymon/server/tmp/ACK_WATCH/ALERT_INIT" or die "Couldn't execute: $!";
Parse all active alerts
while (<ALERTS>) { chomp; @LINE=split(/\|/,$_); $SERVERNAME=@LINE[0]; $TESTTYPE=@LINE[1]; $COLOR=@LINE[2]; $LASTCHANGE=@LINE[4]; $LOGTIME=@LINE[5]; $VALTIME=@LINE[6]; $ACKTIME=@LINE[7]; $DISTIME=@LINE[8]; $COOKIE=@LINE[10]; $LINE1=@LINE[11]; $ACKMSG=@LINE[12]; $DISMSG=@LINE[13]; $MSG=@LINE[14];
Skip all alerts except disk, procs, and svcs (others are not tested)
if ("$TESTTYPE" ne "svcs" && "$TESTTYPE" ne "disk" && "$TESTTYPE" ne "procs") { next; }
Alert has been ack'd if ACKTIME is > 0. This is where we watch for changes.
if ( $ACKTIME > 0) { $COUNT+=1; $REDS=0; $YELLOWS=0; $REDS_CMP=0; $YELLOWS_CMP=0; $NEED_COMP=0;
$now = localtime;
$ENDTIME=localtime($ACKTIME);
print LOGFILE "---------------------------- $now ----------------------------\n";
print LOGFILE "Line: @LINE\n";
print LOGFILE "SERVERNAME: $SERVERNAME\n";
print LOGFILE "TESTTYPE: $TESTTYPE\n";
print LOGFILE "COLOR: $COLOR\n";
print LOGFILE "End Time: $ENDTIME\n";
If this is a new ack'd alert we will create a static file that holds current test state.
We will use this file to decide if there have been changes to what has been ack'd.
if (! -e "${ACKSDIR}/${SERVERNAME}${TESTTYPE}${COLOR}${ACKTIME}" ) {
We need to get current details of alert that has been ack'd and store in DETAILS.
open DETAILS, "/home/xymon/server/bin/xymon 0 'xymondlog ${SERVERNAME}.${TESTTYPE}' |" or die "Couldn't execute: $!"; # for testing #open DETAILS, "</home/xymon/server/tmp/ACK_WATCH/ALERT_DETAILS" or die "Couldn't execute: $!";
Create a new static file with current ack details.
open OUTFILE, ">${ACKSDIR}/${SERVERNAME}${TESTTYPE}${COLOR}${ACKTIME}" or die "Couldn't execute: $!";
while (<DETAILS>) {
chomp;
if ( $_ =~ /^&/ ) {
$_ =~ s/\&//;
@DETLINE=split(/ /,$_);
Change colors to numbers red=2 yellow=1 anything else = 0
if ( "$DETLINE[0]" eq "red" ) {
$COL_VALUE = "2";
} elsif ( "$DETLINE[0]" eq "yellow" ) {
$COL_VALUE = "1";
} else {
$COL_VALUE = "0";
}
Create the static file which will be used on subsequent runs.
print OUTFILE "${COL_VALUE}:${DETLINE[1]}\n";
print LOGFILE "DATA: ${COL_VALUE}:${DETLINE[1]}\n";
}
}
close OUTFILE;
We have already recorded the initial state of the test and saved it to a file.
Now we will check new status output with that file to see if the alerts have changed.
This is where we will now look to see if changes have occurred since the alert was ack'd.
} else {
Get current alert state details and use to compare to saved file
open DETAILS, "/home/xymon/server/bin/xymon 0 'xymondlog ${SERVERNAME}.${TESTTYPE}' |" or die "Couldn't execute: $!"; # for testing #open DETAILS, "</home/xymon/server/tmp/ACK_WATCH/ALERT_DETAILS2" or die "Couldn't execute: $!";
Save the current alert state if needed after disabling of ack.
$SAVED_STATUS="\nALERT or WARNING status has changed from time of ACK!\nACK IS CANCELED\n";;
while (<DETAILS>) {
chomp;
if ( $_ =~ /^&/ ) {
$_ =~ s/\&//;
@DETLINE=split(/ /,$_);
Change colors to numbers red=2 yellow=1 anything else = 0
if ( "$DETLINE[0]" eq "red" ) {
$COL_VALUE = "2";
} elsif ( "$DETLINE[0]" eq "yellow" ) {
$COL_VALUE = "1";
} else {
$COL_VALUE = "0";
}
push (@COMP_contents, "${COL_VALUE}:${DETLINE[1]}");
$SAVED_STATUS.="$_\n";
} elsif ( $_ !~ /\|\|/ ) {
$SAVED_STATUS.="$_\n";
}
}
Load the initial alert ack static file.
open INITFILE, "<${ACKSDIR}/${SERVERNAME}${TESTTYPE}${COLOR}${ACKTIME}" or die "Couldn't execute: $!";
while (<INITFILE>) {
chomp;
push (@INITFILE_contents, "$_");
}
close INITFILE;
Create a hash that contains the initial ack file.
%INITF = map(($_,1), at INITFILE_contents);
%COMP = map(($_,1), at COMP_contents);
foreach (@COMP_contents) {
if ($INITF{$_}) {
No change to the alert - nothing to do.
print LOGFILE "Alert hasn't changed: $_\n";
} else {
Alert has changed in some form.
print LOGFILE "Alert has changed: $_\n";
@CURRENT=split(/:/,$_);
$CUR_COLOR=$CURRENT[0];
$CUR_TEST=$CURRENT[1];
@ACKD_EVENT=grep (/:${CUR_TEST}/, @INITFILE_contents);
@ACK_EVENT=split(/:/,$ACKD_EVENT[0]);
$ACK_COLOR=$ACKD_EVENT[0];
$ACK_TEST=$ACKD_EVENT[1];
Compare the current alert color with that which was saved initially.
if ( $CUR_COLOR < $ACK_EVENT[0] ) {
New color is lower than initial color - leave ack alone.
print LOGFILE "NO ACTION NEEDED (new level lower than ack level).\n";
} elsif ( $CUR_COLOR > $ACK_EVENT[0] ) {
New color is greater than initial ack color, dump ack so new alerts can be sent.
if ( $ACK_COLOR == "" ) {
New alert not previously detected (different service, process, or disk alerting)
print LOGFILE "ACK COLOR $ACK_COLOR\n";
print LOGFILE "NEW ALERT - TERMINATE ACK AND SEND NEW ALERT.\n";
Reset the server.test status to green. Next update will reset the alert condition effectivly # canceling the acknowledge.
open RESET, "/home/xymon/server/bin/xymon 0 'status+1 ${SERVERNAME}.${TESTTYPE} green Ack Reset New Alert Rcvd.' |" or die "Couldn't execute: $!";
print LOGFILE "Set status1: status+1 ${SERVERNAME}.${TESTTYPE} green Ack Reset New Alert Rcvd.\n";
close RESET;
open NEWALERT, "/home/xymon/server/bin/xymon 0 'status ${SERVERNAME}.${TESTTYPE} $DETLINE[0] $SAVED_STATUS' |" or die "Couldn't execute: $!";
print LOGFILE "Set status2: satus ${SERVERNAME}.${TESTTYPE} $DETLINE[0] $SAVED_STATUS\n";
close NEWALERT;
exit 0;
} else {
Level of original alert has upgraded (typically yellow->red)
print LOGFILE "ACK COLOR $ACK_COLOR\n";
print LOGFILE "OLD ALERT - TERMINATE ACK AND SEND NEW ALERT.\n";
Reset the server.test status to green. Next update will reset the alert condition effectivly # canceling the acknowledge.
open RESET, "/home/xymon/server/bin/xymon 0 'status+1 ${SERVERNAME}.${TESTTYPE} green Ack Reset Alert Level Changed.' |" or die "Couldn't execute: $!";
print LOGFILE "Set status3: status+1 ${SERVERNAME}.${TESTTYPE} green Ack Reset Alert Level Changed.\n";
close RESET;
open NEWALERT, "/home/xymon/server/bin/xymon 0 'status ${SERVERNAME}.${TESTTYPE} $DETLINE[0] $SAVED_STATUS' |" or die "Couldn't execute: $!";
print LOGFILE "Set status4: status ${SERVERNAME}.${TESTTYPE} $DETLINE[0] $SAVED_STATUS\n";
close NEWALERT;
exit 0;
}
} else {
Nothing to do here.
print "NO ACTION TAKEN\n"; print LOGFILE "NO ACTION NEEDED (new level equals ack level).\n"; } } } } } }
When there are no ack'd alerts clean out the ACK status directory.
if ( $COUNT == 0 ) { unlink glob "${ACKSDIR}/*"; }
Thanks, John Upcoming PTO:
John Rothlisberger IT Strategy, Infrastructure & Security - Technology Growth Platform TGP for Business Process Outsourcing Accenture 312.693.3136 office
-----Original Message----- From: Xymon [mailto:xymon-bounces at xymon.com] On Behalf Of John Thurston Sent: Monday, November 02, 2015 5:59 PM To: xymon at xymon.com Subject: [Xymon] "Disable until change"
We often use "disable until ok", but it was brought to my attention that it has burned us from time to time. For example:
Host foo is yellow on disk. But that's ok. We're going to allocate some new storage for it in the next service window. The test is marked "disable until ok". But before the service window arrives, something chews up a whole bunch of disk and the now-red test continues to be blue because the test is not yet ok.
We sometimes use "acknowledge" for this function, but the non-green screen can get kind of cluttered this way.
Does anyone have a good way to fake "disable while status remains unchanged"?
Do things because you should, not just because you can.
John Thurston 907-465-8591 John.Thurston at alaska.gov Enterprise Technology Services Department of Administration State of Alaska
Xymon mailing list Xymon at xymon.com http://lists.xymon.com/mailman/listinfo/xymon
This message is for the designated recipient only and may contain privileged, proprietary, or otherwise confidential information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the e-mail by you is prohibited. Where allowed by local law, electronic communications with Accenture and its affiliates, including e-mail and instant messaging (including content), may be scanned by our systems for the purposes of information security and assessment of internal compliance with Accenture policy.
www.accenture.com
Xymon mailing list Xymon at xymon.com http://lists.xymon.com/mailman/listinfo/xymon This communication is the property of CenturyLink and may contain confidential or privileged information. Unauthorized use of this communication is strictly prohibited and may be unlawful. If you have received this communication in error, please immediately notify the sender by reply e-mail and destroy all copies of the communication and any attachments.
participants (6)
-
cleaver@terabithia.org
-
john.r.rothlisberger@accenture.com
-
john.thurston@alaska.gov
-
novosirj@ca.rutgers.edu
-
Paul.Root@CenturyLink.com
-
Scot.Kreienkamp@la-z-boy.com