I typically get the alert message closely followed by the recovery message which doesn't make sense since if the poll time is 5 min. I would expect the alert and recovery to have some seperation in them but they seem to have the same timestamp....I will verify that.
Yes, I am running RC5 plus the patch for the duplicate recovery messages
I need to qualify this...It seems that this is a problem but I can't say for certain yet since I haven't tried to manually cause high cpu load for <20 minutes but I will try that. I just wanted to see if anyone else had seen any similar symptoms
KEvin
-----Original Message----- From: Henrik Stoerner [mailto:henrik at hswn.dk] Sent: Monday, March 14, 2005 5:21 PM To: hobbit at hswn.dk Subject: Re: [hobbit] DURATION tag Importance: Low
On Mon, Mar 14, 2005 at 05:06:20PM -0500, Kevin.Hanrahan at novainfo.com wrote:
I'm not sure the duration tag is working correctly in the hobbit-alerts.cfg setup. I have tests like I/O and CPU that will spike for a short time and I wanted to eliminate the email notifications for those spikes. I set the DURATION tag for 10 or 20 minutes like this:
HOST=$SERVER1 MAIL $SYSADMIN COLOR=red EXSERVICE=msgs,cpu,http,webContent REPEAT=30m RECOVERED MAIL $SYSADMIN COLOR=red SERVICE=cpu DURATION>20 REPEAT=30m RECOVERED
The messages you get - are they alert messages or recovery messages?
I suppose you're running RC5 plus the patch I sent you for the duplicate recovery messages ?
Regards, Henrik
To unsubscribe from the hobbit list, send an e-mail to hobbit-unsubscribe at hswn.dk
On Tue, Mar 15, 2005 at 12:32:26AM -0500, Kevin Hanrahan wrote:
I typically get the alert message closely followed by the recovery message which doesn't make sense since if the poll time is 5 min. I would expect the alert and recovery to have some seperation in them but they seem to have the same timestamp....I will verify that.
Yes, I am running RC5 plus the patch for the duplicate recovery messages
I need to qualify this...It seems that this is a problem but I can't say for certain yet since I haven't tried to manually cause high cpu load for <20 minutes but I will try that. I just wanted to see if anyone else had seen any similar symptoms
I use the DURATION setting myself, and haven't seen any alerts where it was not observed.
I'd like you to dig into the history logs for one of these occurrences and get the timestamps for when it went red and then back to green, and the correllate that with the notifications.log file of when alert- and recovery-messages were sent.
I you'd rather not, then just send me the ~/data/hist/HOSTNAME.cpu file and the output from "grep HOSTNAME~/data/acks/notifications.log".
Regards, Henrik
participants (2)
-
henrik@hswn.dk
-
Kevin@ewormhole.com