Hi Matt,
Sorry for the delay. Had some unexpected time away from the keyboard this weekend.
Responses inline.
On Fri, December 4, 2015 3:49 am, Matt Vander Werf wrote:
Hi J.C.,
Thanks for the e-mail and advice!
A couple of questions:
What's the default --lqueue value that Xymon uses? (Is there a way to see what it's using?)
What exactly is your definition of "tons of simultaneous connections" here? Can you give me a number or range that you think would warrant increasing the --lqueue value?
The default is 512, which is compiled in. This really won't need to be increased unless xymond is being bogged down with lots of *literally* simultaneous waiting connections. It can be increased, but there's probably another sort of problem happening: either slow connectivity, high CPU load, or "backpressure" from too many channel workers causing xymond itself to be unable to keep up. I'm trying to think back and I don't think I had cause to increase it until SN was regularly hitting the 2500 msgs/s range, and it was lowered back down once other performance bottlenecks and some packet loss were identified.
Try stracing xymond and seeing what it's doing. If there's a lot of waiting happening for network reading, that might be a sign that lqueue increasing could help. 768 or 1024 should be more than sufficient. Anything more than that except at bursts means there's some other backlog.
Could it be from clients/senders with longer than usual process listings? Or other clientlog statistics? (But still under the max client message value.)
It's possible, but unless you're bandwidth restricted somewhere senders should generally still be able to complete in the default time frame. If you *are* bandwith restricted then that's definitely something to consider, especially if the machines you're having problems with have a lot of burst network activity. (Speaking of burst network activity, try commenting out the 'netstat' output in the client if you don't have any port checks against the host.)
How would I be able to tell if there are long messages being sent in if the long messages are being discarded?
Yeah, this should probably be added in. Truncated messages have their first line displayed, but it's not so much a 'discard' here as it is a network timeout first and foremost.
An strace with the -s 4096 (or some high number) might be able to catch the first bit of a read from the client if you're lucky there...
HTH, -jc