Andy Smith wrote:
Hi,
In February, Gautier reported this issue with xymonproxy on Solaris :-
http://lists.xymon.com/pipermail/xymon/2014-February/039160.html
I have come this week to update an installation of 4.2.3 on Solaris 9 and have encountered the exact same issue as Gautier, but this time on the latest 4.3.17 code :-
2014-05-04 13:05:36 xymonproxy version 4.3.17 starting 2014-05-04 13:20:41 Listening on 0.0.0.0:1984 <http://0.0.0.0:1984> 2014-05-04 13:20:41 Sending to Xymon server(s) xx.xx.xx.xx:1984 2014-05-04 13:20:41 select() failed: Invalid argument 2014-05-04 13:20:41 select() failed: Invalid argument 2014-05-04 13:20:41 select() failed: Invalid argument 2014-05-04 13:20:41 select() failed: Invalid argument 2014-05-04 13:20:41 select() failed: Invalid argument 2014-05-04 13:20:41 select() failed: Invalid argument 2014-05-04 13:20:41 Too many select failures, aborting 2014-05-04 13:20:46 xymonproxy version 4.3.17 starting
I do not suffer the connections in TIME_WAIT, just the constant restarting of the proxy every 15 minutes. Here is the truss as it gasps when falling over :-
poll(0xFFBFF208, 1, 1000) = 0 time() = 1399206937 poll(0xFFBFF208, 1, 1000) = 0 time() = 1399206938 poll(0xFFBFF208, 1, 1000) = 0 time() = 1399206939 poll(0xFFBFF208, 1, 1000) = 0 time() = 1399206940 poll(0xFFBFF208, 1, 1000) = 0 time() = 1399206941 poll(0xFFBFF208, 1, 1000) = 0 time() = 1399206942 poll(0xFFBFF208, 1, 1000) = 1 accept(3, 0x0003AC60, 0xFFBFF310, 1) = 4 fcntl(4, F_SETFL, 0x00000080) = 0 time() = 1399206942 poll(0xFFBFF200, 2, 1000) = 1 read(4, " s t a t u s + 4 5 c s".., 8185) = 140 time() = 1399206942 poll(0xFFBFF200, 2, 1000) = 1 read(4, 0x00038CE2, 8045) = 0 time() = 1399206942 shutdown(4, 2, 1) = 0 close(4) = 0 poll(0xFFBFF208, 1, 1000) = 1 accept(3, 0x0003ACD0, 0xFFBFF310, 1) = 4 fcntl(4, F_SETFL, 0x00000080) = 0 time() = 1399206942 time() = 1399206942 write(2, " 2 0 1 4 - 0 5 - 0 4 1".., 19) = 19 write(2, " ", 1) = 1 write(2, " s e l e c t ( ) f a i".., 34) = 34 time() = 1399206942 time() = 1399206942 write(2, " 2 0 1 4 - 0 5 - 0 4 1".., 19) = 19 write(2, " ", 1) = 1 write(2, " s e l e c t ( ) f a i".., 34) = 34 time() = 1399206942 time() = 1399206942 write(2, " 2 0 1 4 - 0 5 - 0 4 1".., 19) = 19 write(2, " ", 1) = 1 write(2, " s e l e c t ( ) f a i".., 34) = 34 time() = 1399206942 time() = 1399206942 write(2, " 2 0 1 4 - 0 5 - 0 4 1".., 19) = 19 write(2, " ", 1) = 1 write(2, " s e l e c t ( ) f a i".., 34) = 34 time() = 1399206942 time() = 1399206942 write(2, " 2 0 1 4 - 0 5 - 0 4 1".., 19) = 19 write(2, " ", 1) = 1 write(2, " s e l e c t ( ) f a i".., 34) = 34 time() = 1399206942 time() = 1399206942 write(2, " 2 0 1 4 - 0 5 - 0 4 1".., 19) = 19 write(2, " ", 1) = 1 write(2, " s e l e c t ( ) f a i".., 34) = 34 time() = 1399206942 write(2, " 2 0 1 4 - 0 5 - 0 4 1".., 19) = 19 write(2, " ", 1) = 1 write(2, " T o o m a n y s e l".., 35) = 35 _exit(1)
So, question to Gautier, are you using Solaris 9 and have you managed to resolve this?
Another question to the rest of the list, this is actually the only proxy I have on Solaris, all the otehrs are on Redhat, is anyone else using xymonproxy on Solaris and if so, what version? For the time being, I am running the old bbproxy until I get this fixed, the rest of 4.3.17 seems to be working OK.
Done a bit more digging around. Firstly, if I regress to r#7368 (4.3.13) then xymonproxy on Solaris is stable. This just hides the problem of course and might be a factor in Gautier's performance issue.
If I modify the code for 4.3.17 to remove the exit after 5 select() failures and add in some further debugging, I can observe that on Solaris 9 at least :-
- every 900 seconds, select() fails
- select continues to fail for 2 seconds then succeeds and the proxy continues as normal.
- during these 2 seconds, there are no further calls to poll(), but somewhere in the region of 50,000 calls to time().
- the values for the selecttmo structure and maxfd are reasonable, so the invalid argument must be one of the fdread or fdwrite structures.
Continuing to collect information but still not sure if I am looking at a Sol9 issue or if this affects later Solaris versions.
Andy