[Xymon] Always purple history after time shift on server - how to fix

10 Mar 2016 · *.*

      I would like to share some hints in resolving history reporting problem
after big time shift on monitoring server - about 4 hours.
May be it will help anyone else.
It was some month ago, but I have found time to fix it only today.
What happened:

Time on monitoring host increased for 4 hours.
As result - all metrics reported Purple status (it is intended
functionality, but would be nice XyMon detect big time shift and adopt
reporting in some way)
It was problem at virtual host provider, I had reported the problem and
time was fixed back to correct value
To fix current reporting I had cleaned some files under xymon/logs or
acks (really I do not remember which ones right now) - this has reset last
status duration information, but current values for all metrics become
correct
Everythig become  OK, except that when I check history for metric (
...xymon-cgi/history.sh? ...)  for some metrics.
XyMon always reported Purple for last event (since that incident time).

It was just for some metrics (not all) and I had second monitoring server
with the same information (not having time shift incident) and I was able
to live with it some month.
Solution:
Today I have fixed that reporting problem with the following steps, which
should be executed for every host-metric pair having the problem
We should operate with 2 files:

host history file  like
hist/HOSTNAME

here we should find records with negative duration values like:
svcs 1435410898 1435426055 -15157 gr pu 1
who 1435410899 1435426055 -15156 gr pu 1
msgs 1435410899 1435426055 -15156 gr pu 1
netstat 1435410899 1435426055 -15156 gr pu 1
memory 1435411034 1435426055 -15021 ye pu 2
uptime 1435411140 1435426055 -14915 gr pu 1
procs 1435411145 1435426055 -14910 gr pu 1
disk 1435411150 1435426055 -14905 ye pu 2
cpu 1435411222 1435426055 -14833 gr pu 1
and drop them

service history file like
hist/HOSTNAME.svc

again -  find records with negative duration values like:
Sat Jun 27 20:27:35 2015 purple 1435426055 -15157
and  drop record(s)  - really should be just one
Really to fix just one service reporting - it is enough to drop negative
duration records from service history file only (tested).
But I do not see any reason to have such records in host history file, so
I delete from that file too.
How to automate the process:
find hist files for
step 1:
find hist/ -print0 -name "*.*" | xargs -0 grep " -" | awk '{print $1"
:"$4}' | grep ":-"
#output like:
...
hist/idc-oracle03.msc-sh.local:ssh :-14862
hist/idc-oracle03.msc-sh.local:dblock :-15012
hist/idc-oracle03.msc-sh.local:dbrec :-15012
hist/idc-oracle03.msc-sh.local:dbup :-15011
hist/idc-oracle03.msc-sh.local:dbext :-14989
...
step 2:    find hist/ -print0 -name "*.*" | xargs -0 grep " -" | awk
'{print $1" :"$8}' | grep ":-"
output like:
..
hist/idc-oracle03,domain.com.dbrec:Sat :-15012
hist/gdc-oracle03,domain.com.dbup:Sat :-15136
hist/idc-oracle01,domain.com.disk:Sat :-14961
hist/gdc-oracle01,domain.com.dbaud:Thu :-26793
hist/gdc-oracle01,domain.com.dbaud:Sat :-14940
..
Then can automate the records removal too.
Best regards,
Andrey Chervonets
SIA CoMinder
http://www.cominder.eu/

[Xymon] Always purple history after time shift on server - how to fix

A.Chervonets＠cominder.eu

here we should find records with negative duration values like:

and drop them

again - find records with negative duration values like:

and drop record(s) - really should be just one

find hist files for

step 1:

step 2: find hist/ -print0 -name "." | xargs -0 grep " -" | awk

output like:

Andrey Chervonets