I would like to share some hints in resolving history reporting problem after big time shift on monitoring server - about 4 hours. May be it will help anyone else.
It was some month ago, but I have found time to fix it only today. What happened:
- Time on monitoring host increased for 4 hours.
- As result - all metrics reported Purple status (it is intended functionality, but would be nice XyMon detect big time shift and adopt reporting in some way)
- It was problem at virtual host provider, I had reported the problem and time was fixed back to correct value
- To fix current reporting I had cleaned some files under xymon/logs or acks (really I do not remember which ones right now) - this has reset last status duration information, but current values for all metrics become correct
- Everythig become OK, except that when I check history for metric ( ...xymon-cgi/history.sh? ...) for some metrics. XyMon always reported Purple for last event (since that incident time).
It was just for some metrics (not all) and I had second monitoring server with the same information (not having time shift incident) and I was able to live with it some month.
Solution: Today I have fixed that reporting problem with the following steps, which should be executed for every host-metric pair having the problem
We should operate with 2 files:
- host history file like hist/HOSTNAME
here we should find records with negative duration values like:
svcs 1435410898 1435426055 -15157 gr pu 1 who 1435410899 1435426055 -15156 gr pu 1 msgs 1435410899 1435426055 -15156 gr pu 1 netstat 1435410899 1435426055 -15156 gr pu 1 memory 1435411034 1435426055 -15021 ye pu 2 uptime 1435411140 1435426055 -14915 gr pu 1 procs 1435411145 1435426055 -14910 gr pu 1 disk 1435411150 1435426055 -14905 ye pu 2 cpu 1435411222 1435426055 -14833 gr pu 1
and drop them
- service history file like hist/HOSTNAME.svc
again - find records with negative duration values like:
Sat Jun 27 20:27:35 2015 purple 1435426055 -15157
and drop record(s) - really should be just one
Really to fix just one service reporting - it is enough to drop negative duration records from service history file only (tested). But I do not see any reason to have such records in host history file, so I delete from that file too.
How to automate the process:
find hist files for
step 1:
find hist/ -print0 -name "*.*" | xargs -0 grep " -" | awk '{print $1" :"$4}' | grep ":-"
#output like: ... hist/idc-oracle03.msc-sh.local:ssh :-14862 hist/idc-oracle03.msc-sh.local:dblock :-15012 hist/idc-oracle03.msc-sh.local:dbrec :-15012 hist/idc-oracle03.msc-sh.local:dbup :-15011 hist/idc-oracle03.msc-sh.local:dbext :-14989 ...
step 2: find hist/ -print0 -name "*.*" | xargs -0 grep " -" | awk
'{print $1" :"$8}' | grep ":-"
output like:
.. hist/idc-oracle03,domain.com.dbrec:Sat :-15012 hist/gdc-oracle03,domain.com.dbup:Sat :-15136 hist/idc-oracle01,domain.com.disk:Sat :-14961 hist/gdc-oracle01,domain.com.dbaud:Thu :-26793 hist/gdc-oracle01,domain.com.dbaud:Sat :-14940 ..
Then can automate the records removal too.
Best regards,
Andrey Chervonets
SIA CoMinder http://www.cominder.eu/