Hi all,
I am redesigning the method we use for performing a failover to a disaster recovery installation of hobbit. I am interested in opinions on the approach and any shortcomings.
Note: This is not HA/clustering, it is for DR purposes.
We are aiming to have:
a production hobbit deployment a DR hobbit deployment
clients will be configured to send metrics to both servers. which will keep historical rrd data up to date etc.
The production server will be configured to send out alerts. The dr server will not.
At regular intervals, rsync will be used to synchronise data from the production server to the dr server, including the in memory checkpoint file.
In the event of a dr, the dr hobbit server will be promoted to active by restarting hobbit, and loading the checkpoint and alert configurations.
I am expecting that this will ensure that the dr server will be "up to date" with proudction as per the last checkpoint. This includes tests that have been disabled or acknowledged.
Prior to failback to the production hobbit installation, the reverse of the above would be performed. An rsync of rrd data files would be performed to cover any windows where one of the servers was offline for a period of time.
Is there anything wrong with this approach?
Cheers
Phil
-- Tel: 0400 466 952 Fax: 0433 123 226 email: philwild AT gmail.com