Hi Buchan,
On 02/ 7/11 10:31 PM, Buchan Milne wrote:
On Monday, 7 February 2011 16:37:14 Dominique Frise wrote:
Hi Henrik,
Thanks for replying.
On 02/ 7/11 01:10 PM, Henrik Størner wrote:
In<4D4C0F83.8080204 at unil.ch> Dominique Frise<dominique.frise at unil.ch> writes:
What is the minimum time for the same alert status to stay up to be processed correctly by Xymon ?
I am not sure I understand the question - are you saying that Xymon does not generate the notifications you expect it to ?
Sort of...
We have SNMP trap handling configured (thanks Andy Farrior)
It is an ugly hack. We need a better solution. I didn't implement this one for my own environments, as I was not willing to settle for it (one issue being the multiple parts, snmptrapd->snmptt->sec->perl script), but I haven't finished the work I wanted to do (a perl NetSNMP::TrapReceiver running in snmptrapd that does all the tasks above) to have a better solution.
Well Andy's work is advertized as "A very elegant method of feeding traps into Xymon" ;-) (http://www.xymon.com/xymon/help/xymon-tips.html#snmptraps) This is also the kind of approach that is used for Nagios but there alerting is better supported by the "volatile" service .(http://nagios.sourceforge.net/docs/2_0/volatileservices.html).
but are not completely happy with how it handles the alerting. When a bad trap from a given host is received, an alert status is generated for Xymon (yellow or red). So far, so good.
Actually, IMHO, no. The BB model works on monitoring a status, and generating an event when the status changes. The problem comes when you listen for events (traps), and the only way to handle them is to create a status, so you can generate an event.
I think event-based monitoring should not go via 'status' messages, but go into a separate channel, which handles events as events, and possibly alerts directly instead of via the status channel.
Agree
Then, before this status'validity is expired (before it turns purple), a periodic launch of a script will reset its color to green, thus generating a recovered message indenpendently of the real status of the service reported by the trap. Further more, while a<host>.trap status is in alert state, other bad traps from same host and of same level will not generate any alerts (igmored).
This is a generic problem, and applies to some extent to other tests as well. Even if different types of traps were reported to different tests, there is the issue of no component-level ack/alert/recover/disable etc. So, for example, if non-critical filesystem goes yellow, and this is ack'ed or disabled, then a critical filesystem does red, there will be no new notification, it won't appear on the critical systems view, just as a trap for a non-critical router interface will be lumped together with a critical one.
Not trivial to solve
Here follow a description of what we are trying to implement in order to improve this hanlding:
- a bad<host>trap is detected.
- generate a yellow/red<host>.trap status for Xymon.
- after a short delay (ideally 1 sec.), generate a clear<host>.trap status for Xymon.
So, the status page for the host is useless, the only thing you get is alerting, it would be much better (IMHO) to go:
1)snmptrapd running NetSNMP::TrapReceiver which does MIB parsing etc., pruning of duplicate traps itself, storing some trap details, and sends an 'event' message to hobbitd. 2)A hobbit worker listening on the event channel and deciding when to send page or ack messages to hobbitd for hobbitd_alert to act on. In some cases, it might be desirable for it to do something besides alert (e.g. trigger a configuration update for a network device on a device configuration save trap)
Solid concept indeed
All traps status except those in alert state are periodically set to clear. The red/yellow -> clear transition should not generate a recovered message. This should be achieved by removing "clear" from "OKCOLORS" in xymonserver.cfg but this does not work without modifying xymond_alert.c. A good<host>.trap should generate a green message and thus a recovered message.
This is mostly just going to result in disk churn that you don't even want to look at, just to send some mails. If you didn't have Xymon in the picture, snmptrapd and traptoemail would do most of what you get ...
The database history fed by snmptt is quite useful too
We know that a 100% handling of traps in Xymon is not possible because we are misusing a single status (trap) to report many others, but his scenario would allow:
- a better alerting of all bad traps from the same host and of same level.
Well, it is slightly better, but I don't see how traps for different reasons in different orders are going to be handled well.
Not covered at all :-(
- the recovered status is a real recover (the text of the trap explains what recovered)
This is about the only advantage, and I think there is more that could be improved with fewer disadvantages.
Eager to test your solution... Dont forget to drop us a mail when its ready for testing!
Regards, Dominique
Regards, Buchan
To unsubscribe from the xymon list, send an e-mail to xymon-unsubscribe at xymon.com