[hobbit] procs test keeps paging, although green for +4 days - Xymon - Xymon mailman web

newer
[hobbit] Hobbit cluster and DRDB

[hobbit] procs test keeps paging, although green for +4 days

older
hobbit-alerts.cfg : how to send an...

David.Gore＠VerizonBusiness.com

17 Aug 2006 17 Aug '06

4:05 p.m.

Interesting problem, we keep getting paged for a procs test on a host twice a day although the procs test has been green for the last 4+ days. It would appear something is stuck? What to look at? It is in couple 'chk' files in ~/server/tmp, but I wouldn't know what to make of that. Ideas to troubleshoot?

~David

Reply

Sign in to reply online Use email software

Show replies by date

henrik＠hswn.dk

17 Aug 17 Aug

4:28 p.m.

On Thu, Aug 17, 2006 at 04:05:38PM +0000, David Gore wrote:

Interesting problem, we keep getting paged for a procs test on a host twice a day although the procs test has been green for the last 4+ days. It would appear something is stuck? What to look at? It is in couple 'chk' files in ~/server/tmp, but I wouldn't know what to make of that. Ideas to troubleshoot?

The alert.chk.sub file contains the recipients of the alerts currently active - there is one line for each recipient. E.g.

1157011746|myserver|sslcert|mail|henrik at test.com

The alert.chk file has one line for every status that is in a potentially alerting state, i.e. it is red, yellow or purple. E.g.

myserver|sslcert|mysite/webservers|10.0.36.166|yellow|1155640943|1157011746|paging|status ...

The field that says "paging" has "norecip" if the status doesn't have any alert-recipients defined, or if the alerts are restricted, e.g. to a certain time of day.

I've never seen it happen, but there is a very small time window between the startup of the hobbitd daemon and the startup of the hobbitd_alert module where a green update is registered with hobbitd, but it doesn't make it to the hobbitd_alert module - and then you have this situation.

Restarting the hobbitd_alert module - just kill the hobbitd_alert process, it will restart automatically - should clean it up, logging a message like "Stale alert for HOSTNAME:TEST dropped" to the page.log file.

Regards, Henrik

Reply

Sign in to reply online Use email software

David.Gore＠verizonbusiness.com

10:14 p.m.

Henrik Stoerner wrote:

On Thu, Aug 17, 2006 at 04:05:38PM +0000, David Gore wrote:

...
Interesting problem, we keep getting paged for a procs test on a host twice a day although the procs test has been green for the last 4+ days. It would appear something is stuck? What to look at? It is in couple 'chk' files in ~/server/tmp, but I wouldn't know what to make of that. Ideas to troubleshoot?

The alert.chk.sub file contains the recipients of the alerts currently active - there is one line for each recipient. E.g.

1157011746|myserver|sslcert|mail|henrik at test.com

The alert.chk file has one line for every status that is in a potentially alerting state, i.e. it is red, yellow or purple. E.g.

myserver|sslcert|mysite/webservers|10.0.36.166|yellow|1155640943|1157011746|paging|status ...

The field that says "paging" has "norecip" if the status doesn't have any alert-recipients defined, or if the alerts are restricted, e.g. to a certain time of day.

I've never seen it happen, but there is a very small time window between the startup of the hobbitd daemon and the startup of the hobbitd_alert module where a green update is registered with hobbitd, but it doesn't make it to the hobbitd_alert module - and then you have this situation.

Restarting the hobbitd_alert module - just kill the hobbitd_alert process, it will restart automatically - should clean it up, logging a message like "Stale alert for HOSTNAME:TEST dropped" to the page.log file.

Killed hobbitd_alert, it restarted and dropped the bogus alert as you said it would. Thank you!

~David

Reply

Sign in to reply online Use email software

7251

Age (days ago)

7251

Last active (days ago)

Download

2 comments

2 participants

tags

participants (2)

David.Gore＠VerizonBusiness.com
henrik＠hswn.dk