On Thu, Jan 27, 2005 at 10:21:50AM -0500, Kauffman, Tom wrote:
I currently run two BBDisplay systems, with the second serving as a BBNET and BBPAGER failover system. It also serves as a document server and our computer room X server. We recover this system at our D/R hotsite, and part of the recovery scripting in place doctors the bb-hosts file on this and all other recovered systems to set this up as the D/R BB system.
I see failover (as such) isn't in the initial cut of hobbit. What problems am I setting myself up for if I allow the network tests to run on both systems? I can handle the redundant paging by kludging my own 'fallover' script to rename my paging script on the fallover, but it looks like network testing would be harder to fake out. (I've only got about 200 devices I'm testing)
And I want both systems to have matching LARRD graphs :-) which is the reason I set things up as is.
As I understand your description, the two BBDISPLAY servers are normally running in parallel, each with their own set of data for webpages, RRD files, history logs etc - correct ?
Also, is the primary BBNET server running on the same box as the primary BBDISPLAY server, or a separate system ?
If BBDISPLAY and BBNET functions are combined on the same server both at the main site and the D/R site, then my suggestion is pretty simple: Just run the two sites completely in parallel, with each of the BBNET servers reporting to "their own" BBDISPLAY server. The only downside of this is that the measurements on the two sites might disagree! So there is no guarantee that you'll have identical displays at the two sites.
On the D/R server you just disable the "[bbpage]" task in hobbitlaunch.cfg - that way, no alarms get sent.
You can sync the bb-hosts and hobbit-alerts.cfg files between the two servers without any problems; Hobbit doesn't use the "BBDISPLAY", "BBNET" or "BBPAGER" tags in the bb-hosts file at all.
Handling failover in this situation means somehow getting the D/R server to detect that Hobbit is down on the primary server, and then starting up the hobbitd_alert task to process alerts. You'll probably need to do some script that periodically either checks the primary server itself, or runs bb 127.0.0.1 "query primaryserver.bbd" and takes action when that status changes.
Another option is to setup a hobbitd worker module that picks up events from the "stachg" (status change) channel, and reacts to changes in the state of the primary server. That might be a more elegant solution - see the "hobbitd_sample.c" file in the Hobbit sources for an example of how to write a worker module.
If you only want the BBNET function running on one server at a time (e.g. because you must have identical displays at the two sites), or you have the BBNET server running on a different system at the primary site, then the situation becomes a bit more complex.
The main site where BBNET normally runs would of course be configured to send the results to both of the BBDISPLAY servers. As described above, you can handle failover of the alert function by having the hobbitd_alert module turned off normally, and enabling it if the primary BBPAGER goes down.
The problem as I see it would be how the D/R server detects that the primary BBNET server has failed. One possibility would be to enable the bbtest-net "--report" option; this makes bbtest-net send in a report about itself as the last status report from one cycle of the network tests. Something on the D/R server could monitor when this status was last received, and if it becomes too old it would then fire up the network testing task on the D/R server. If the status-report lifetime that bbtest-net generates was set to e.g. 10 minutes, then this could trigger when the status-report turned purple - so you could handle it with the same module that keeps an eye on changes in the primary BBDISPLAY server status.
I haven't finally decided how to handle failover in Hobbit. I imagine that it means implementing a special "heartbeat" message that the servers send to each other - the BBDISPLAY servers exchange one, and the BBNET servers send one to the BBDISPLAY servers to inform them that they are alive. One of the BBDISPLAY servers then acts as an arbitrator, and decides which tasks must run where, and announces this to all of the servers, who then start or stop tasks as needed. If the arbitrator crashes, another BBDISPLAY server will have to take over.
(We monitor connectivity to RF antennas used for bar-code scanning applications and our motley assortment of oracle and SAP instances on a mob of AIX systems).
Sounds like a lot of custom scripts are in use :-)
Henrik