dropping/making blue checks not persistent when restarting
An HTML attachment was scrubbed... URL: <http://lists.xymon.com/pipermail/xymon/attachments/20170522/afd5d041/attachment.html>
On 5/22/2017 1:55 AM, Sven Schuster wrote:
Sorry, I should have been a bit more precise in this regard:
- test disabled are disabled via enable/disable from the Administration menu for some period of time, e.g. 2 hours, without "until OK" checked. It doesn't matter if you're blueing out a green (e.g. planned downtime) or red test. The problem remains the same.
- the restart is done to make changes visible immediately for checking the change after applying it
- dropped tests are of checks (or hosts) which don't exist anymore, so there won't be any checks coming in for the checks/hosts dropped Yes when waiting for some time before restarting after disabling or dropping a check, that change will "survive" the restart. As pointed out in Jeremy Laidman's post, this indeed seems to be due to the checkpoint interval which is 600 seconds in the local configuration.
Kind regards, Sven *Gesendet:* Freitag, 19. Mai 2017 um 16:02 Uhr *Von:* "Root, Paul T" <Paul.Root at CenturyLink.com> *An:* "'Sven Schuster'" <Schuster.Sven at gmx.de>, "xymon at xymon.com" <xymon at xymon.com> *Betreff:* RE: [Xymon] dropping/making blue checks not persistent when restarting
So, there’s a couple things here.
First, how are you disabling (bluing out) a test (you call check)? Are you checking the “until OK” or are you providing a time limit for the disable? Also, if the test is green why would you want it disabled?
Second, why are you restarting xymon after a config change? All configuration files are re-read (except local-client.cfg) every 5 minutes.
Next, you say dropped tests reappear. Well of course. If the client is providing the test to the server, the server is going to display it. If you don’t want a test in xymon, it has to be disabled at the source.
I don’t understand your second paragraph. You you are saying that you disable a test and then wait 5-10 minutes and the disabled test will remain blue after restarting xymon?
*From:*Xymon [mailto:xymon-bounces at xymon.com] *On Behalf Of *Sven Schuster *Sent:* Friday, May 19, 2017 7:55 AM *To:* xymon at xymon.com *Subject:* [Xymon] dropping/making blue checks not persistent when restarting
Hello everybody,
recently I've been seeing a strange issue on xymon server. When I make a check blue and shortly after xymon gets restarted due to configuration updates, that blue check will be green again afterwards. The same thing happens when a check is dropped and xymon gets restarted directly after that: the dropped check reappears.
If you wait some amount of time before restarting, say 5-10 minutes, the problem won't appear and everything will be fine. I also sync'ed on the server directly after making a check blue and before restarting (to avoid data not being written to disk for some strange reason), which unfortunately did not help.
Environment is xymon 4.3.27 on Debian jessie. Xymon has been updated to 4.3.28 because of this problem lately, with the problem appearing in 4.3.28, too. This server has just been upgraded from Debian wheezy to Jessie a few weeks ago. On wheezy xymon 4.3.27 was in use but didn't show this behaviour.
Did anybody notice such an odd behaviour or maybe have any thoughts regarding possible causes?
Thanks in advance,
Sven
Hi Sven,
This behavior would seem to point in the direction of the checkpoint file not being written out properly on shutdown, especially if it's working fine during the normal checkpointing process (eg, waiting 600 seconds before the restart) and could be a latent bug (or at least a missing error message).
Can you set xymond to --debug mode (or send it -USR2 signal) and then shutdown/restart the process after this change? If shutting down, you can take a quick poke at the checkpoint file to see that it's been updated at the moment of shutdown? Depending on the host in question, you can also search for the test that should "no longer be there" (it's just a simple text file format).
The same routine is called at shutdown as is called during the periodic interval checkpointing, except for the fact that we wait synchronously for it to complete -- precisely to avoid this type of concern, but that doesn't mean there isn't an issue there still.
Regards,
-jc
Hi Japheth,
Hi Sven, This behavior would seem to point in the direction of the checkpoint file not being written out properly on shutdown, especially if it's working fine during the normal checkpointing process (eg, waiting 600 seconds before the restart) and could be a latent bug (or at least a missing error message).
that was exactly my thought when taking a look at the source code. The routine for writing the checkpoint file should be called at shutdown, too...
Can you set xymond to --debug mode (or send it -USR2 signal) and then shutdown/restart the process after this change? If shutting down, you can take a quick poke at the checkpoint file to see that it's been updated at the moment of shutdown? Depending on the host in question, you can also search for the test that should "no longer be there" (it's just a simple text file format).
...and indeed, it *is* called: 10410 2017-05-23 08:00:48.870364 -> save_checkpoint 10410 2017-05-23 08:00:48.963874 <- save_checkpoint These were the last lines of the logfile when stopping xymon. Note that in this case, I *stopped* the xymon service (to be able to take a look at the checkpoint file while xymon is not running). Timestamp of checkpoint file was updated, the test I disabled still was disabled when I started xymon again. Strange. So I did some further testing. It revealed that on Debian with systemd being used for starting/stoping services, the restart option to the default SysV initscript isn't used. Instead, systemd will call the initscript with option stop (which TERMs the xymonlaunch process), wait some amount of time (which is probably given by the RestartSec or RestartUSec parameter, see systemd.service(5)), then the initscript is called again with option start. Seems like the time between stop and start (which is 100ms in the local environment, probably default value) is not long enough for the old, terminating xymond process to completely write the checkpoint file (which is roughly 35 MB here with config changes and disabling/dropping tests happening quite often and independently). In xymond.c/save_checkpoint it turns out that the checkpoint file is written to a temporary file with a timestamp in the filename. That temp file is renamed to the real checkpoint file later. With that short amount of time between stopping and starting it seems like the new xymond process, which is starting in the meantime, just reads an old version of the checkpoint file. To solve this issue, on Linux systems using systemd one might (and of course should ;)) use a real systemd service file with RestartSec set to a sane amount (e.g. 1s like in the old SysV initscript). As a quick fix I added a "sleep 1" in the initscript: --- xymon.orig 2012-06-27 21:14:29.000000000 +0200 +++ xymon 2017-05-23 10:28:51.983171661 +0200 @@ -49,6 +49,7 @@ "stop") log_daemon_msg "Stopping $DESC" "$NAME" start-stop-daemon --exec $DAEMON --pidfile $PIDFILE --stop --retry 5 + sleep 1 log_end_msg $? ;; That way restarting xymon works as expected for me. Yet that might leave the (small) chance of that timespan not being long enough in big installation and high load. Which in turn could just be a hypothetical problem, as that behaviour didn't occur with the old initscript (or at least no one noticed). A clean solution would be to provide a way to do a clean shutdown of the xymon server which returns not before the old processes really have exited (however that might be implemented), so the asynchronous nature of the current stop (sending a TERM to xymonlaunch) is not a concern anymore. That's at least an explanation and possible ways of solving for the behaviour that seems to make sense based on some tests and taking some short looks at the source, so please correct me if I'm wrong ;) Kind regards, Sven
The same routine is called at shutdown as is called during the periodic interval checkpointing, except for the fact that we wait synchronously for it to complete -- precisely to avoid this type of concern, but that doesn't mean there isn't an issue there still. Regards, -jc
participants (2)
-
cleaver@terabithia.org
-
Schuster.Sven@gmx.de