Hi Japheth,
Hi Sven, This behavior would seem to point in the direction of the checkpoint file not being written out properly on shutdown, especially if it's working fine during the normal checkpointing process (eg, waiting 600 seconds before the restart) and could be a latent bug (or at least a missing error message).
that was exactly my thought when taking a look at the source code. The routine for writing the checkpoint file should be called at shutdown, too...
Can you set xymond to --debug mode (or send it -USR2 signal) and then shutdown/restart the process after this change? If shutting down, you can take a quick poke at the checkpoint file to see that it's been updated at the moment of shutdown? Depending on the host in question, you can also search for the test that should "no longer be there" (it's just a simple text file format).
...and indeed, it *is* called: 10410 2017-05-23 08:00:48.870364 -> save_checkpoint 10410 2017-05-23 08:00:48.963874 <- save_checkpoint These were the last lines of the logfile when stopping xymon. Note that in this case, I *stopped* the xymon service (to be able to take a look at the checkpoint file while xymon is not running). Timestamp of checkpoint file was updated, the test I disabled still was disabled when I started xymon again. Strange. So I did some further testing. It revealed that on Debian with systemd being used for starting/stoping services, the restart option to the default SysV initscript isn't used. Instead, systemd will call the initscript with option stop (which TERMs the xymonlaunch process), wait some amount of time (which is probably given by the RestartSec or RestartUSec parameter, see systemd.service(5)), then the initscript is called again with option start. Seems like the time between stop and start (which is 100ms in the local environment, probably default value) is not long enough for the old, terminating xymond process to completely write the checkpoint file (which is roughly 35 MB here with config changes and disabling/dropping tests happening quite often and independently). In xymond.c/save_checkpoint it turns out that the checkpoint file is written to a temporary file with a timestamp in the filename. That temp file is renamed to the real checkpoint file later. With that short amount of time between stopping and starting it seems like the new xymond process, which is starting in the meantime, just reads an old version of the checkpoint file. To solve this issue, on Linux systems using systemd one might (and of course should ;)) use a real systemd service file with RestartSec set to a sane amount (e.g. 1s like in the old SysV initscript). As a quick fix I added a "sleep 1" in the initscript: --- xymon.orig 2012-06-27 21:14:29.000000000 +0200 +++ xymon 2017-05-23 10:28:51.983171661 +0200 @@ -49,6 +49,7 @@ "stop") log_daemon_msg "Stopping $DESC" "$NAME" start-stop-daemon --exec $DAEMON --pidfile $PIDFILE --stop --retry 5 + sleep 1 log_end_msg $? ;; That way restarting xymon works as expected for me. Yet that might leave the (small) chance of that timespan not being long enough in big installation and high load. Which in turn could just be a hypothetical problem, as that behaviour didn't occur with the old initscript (or at least no one noticed). A clean solution would be to provide a way to do a clean shutdown of the xymon server which returns not before the old processes really have exited (however that might be implemented), so the asynchronous nature of the current stop (sending a TERM to xymonlaunch) is not a concern anymore. That's at least an explanation and possible ways of solving for the behaviour that seems to make sense based on some tests and taking some short looks at the source, so please correct me if I'm wrong ;) Kind regards, Sven
The same routine is called at shutdown as is called during the periodic interval checkpointing, except for the fact that we wait synchronously for it to complete -- precisely to avoid this type of concern, but that doesn't mean there isn't an issue there still. Regards, -jc