TCP/IP stats (bits/s) limited to 100M
Hi,
I have data suspicious with TCP/IP stats on Solaris and AIX.
Graphs told me that my hosts don't send/receive more than ~110 Mbits/
s . This raise an alert on the backup server which is installed with
a GigEth interface and need to eat many backups flow simulteanously.
So, I'm checking other GigEth equipped host.
I have some solaris hosts with GigEth. Graphs are inconsistent
between interfaces details and TCP/IP general graphs. Take a look :
http://www.unikservice.com/frp/tcpip.png
http://www.unikservice.com/frp/i1.png
http://www.unikservice.com/frp/i2.png
So, I'm suspecting an issue with collect or graphs. Could somebody
tell me where I should start to debug ?
Nicolas
Hi,
this looks like tcp-data going arround the 32bit counter problem... are your counters 32 bit ? could you give us a copy of them ?
olivier
-----Message d'origine----- De : Nicolas Dorfsman [mailto:ndo at unikservice.com] Envoyé : mercredi 28 juin 2006 10:20 À : hobbit at hswn.dk Objet : [hobbit] TCP/IP stats (bits/s) limited to 100M
Hi,
I have data suspicious with TCP/IP stats on Solaris and AIX.
Graphs told me that my hosts don't send/receive more than ~110 Mbits/
s . This raise an alert on the backup server which is installed with
a GigEth interface and need to eat many backups flow simulteanously.
So, I'm checking other GigEth equipped host.
I have some solaris hosts with GigEth. Graphs are inconsistent
between interfaces details and TCP/IP general graphs. Take a look :
http://www.unikservice.com/frp/tcpip.png
http://www.unikservice.com/frp/i1.png
http://www.unikservice.com/frp/i2.png
So, I'm suspecting an issue with collect or graphs. Could somebody
tell me where I should start to debug ?
Nicolas
To unsubscribe from the hobbit list, send an e-mail to hobbit-unsubscribe at hswn.dk
Le 28 juin 06 à 11:42, Beau Olivier a écrit :
Hi,
this looks like tcp-data going arround the 32bit counter problem... are your counters 32 bit ? could you give us a copy of them ?
I'd be glad to.
Which counter is used by hobbit ?
Nicolas
On Wed, Jun 28, 2006 at 11:42:14AM +0200, Beau Olivier wrote:
Hi,
this looks like tcp-data going arround the 32bit counter problem... are your counters 32 bit ? could you give us a copy of them ?
The RRD files are created as "DERIVE" datatypes with a minimum value of 0, which should handle 32/64-bit counter overflows automatically. (See the rrdcreate man-page).
Note that Hobbit never does any calculations on these values, it passes them directly (as strings) to the RRDtool functions.
Regards, Henrik
Hi,
The 110Mbits/s value you get, does really point to 32bit counter
wrap, because with 32bit BYTE counter, measured every 5 minutes, 110Mbits/s is (aprox) the maximum you can count without wrapping the counter.
As Henrik explained bellow, it should not be a "wrap" done by
hobbit nor RRD. I'm think you need to look at you OS counters directly and see if they're wrapping in less than 5 minutes.
If you on Solaris / SunOS you could use something like the
bellow, and watch if any of the counters wraps 32 bit value (4294967295 if i recall correctly)
(host)$> while [ 1 ]; do date; netstat -s |
egrep "(tcpInInorderBytes|tcpOutDataBytes)" ; sleep 60; done
Example output: Wed Jun 28 09:28:28 EDT 2006 tcpOutDataSegs =5864959 tcpOutDataBytes =4273878800 tcpInInorderSegs =2670997 tcpInInorderBytes =725993348
(CARE) With RRD it's possible to come around of this OS
limitation by feeding the data in shorter times, lets say every 2 minutes. RRD will take care of computing (making) the values "correct" for the steep size used to create the RRD (in hobbit's case 300 secs).
I'm not exactly sure how /and or if hobbit will be happy in
receiving client info quicker than every 5 minutes, but i think it should be transparent.
Hope this can give you some help.
Regards
Werner
----------------------- Original Message ----------------------- From: henrik at hswn.dk (Henrik Stoerner) To: hobbit at hswn.dk Date: Wed, 28 Jun 2006 13:09:13 +0200 Subject: Re: [hobbit] TCP/IP stats (bits/s) limited to 100M
On Wed, Jun 28, 2006 at 11:42:14AM +0200, Beau Olivier wrote:
Hi,
this looks like tcp-data going arround the 32bit counter problem... are your counters 32 bit ? could you give us a copy of them ?
The RRD files are created as "DERIVE" datatypes with a minimum value of 0, which should handle 32/64-bit counter overflows automatically. (See the rrdcreate man-page).
Note that Hobbit never does any calculations on these values, it passes them directly (as strings) to the RRDtool functions.
Regards, Henrik
Le 28 juin 06 à 15:42, Werner (gmail Lists) a écrit :
Hi,
The 110Mbits/s value you get, does really point to 32bit counter wrap, because with 32bit BYTE counter, measured every 5 minutes, 110Mbits/s is (aprox) the maximum you can count without wrapping the counter.
As Henrik explained bellow, it should not be a "wrap" done by hobbit nor RRD. I'm think you need to look at you OS counters directly and see if they're wrapping in less than 5 minutes.
If you on Solaris / SunOS you could use something like the bellow, and watch if any of the counters wraps 32 bit value
(4294967295 if i recall correctly)
Correct. Found this document which approves what you're saying :
http://sunsolve.sun.com/search/document.do?assetkey=1-25-72535-1
Le 28 juin 06 à 13:09, Henrik Stoerner a écrit :
On Wed, Jun 28, 2006 at 11:42:14AM +0200, Beau Olivier wrote:
Hi,
this looks like tcp-data going arround the 32bit counter problem... are your counters 32 bit ? could you give us a copy of them ?
The RRD files are created as "DERIVE" datatypes with a minimum
value of 0, which should handle 32/64-bit counter overflows automatically. (See the rrdcreate man-page).
Well...the man is not so confident :
COUNTER
is for continuous incrementing counters like the
ifInOctets counter in a router. The COUNTER data
source assumes that the counter never decreases,
except when a counter overflows. The update
function takes the overflow into account. The
counter is stored as a per-second rate. When the
counter overflows, RRDtool checks if the
overflow happened at the 32bit or 64bit border
and acts accordingly by adding an appropriate
value to the result.
DERIVE
will store the derivative of the line going from
the last to the current value of the data
source. This can be useful for gauges, for
example, to measure the rate of people entering
or leaving a room. Internally, derive works
exactly like COUNTER but without overflow
checks. So if your counter does not reset at 32
or 64 bit you might want to use DERIVE and
combine it with a MIN value of 0.
NOTE on COUNTER vs DERIVE
by Don Baarda <don.baarda at baesystems.com>
If you cannot tolerate ever mistaking the
occasional counter reset for a legitimate
counter wrap, and would prefer "Unknowns"
for all legitimate counter wraps and resets,
always use DERIVE with min=0. Otherwise,
using COUNTER with a suitable max will
return correct values for all legitimate
counter wraps, mark some counter resets as
"Unknown", but can mistake some counter
resets for a legitimate counter wrap.
For a 5 minute step and 32-bit counter, the
probability of mistaking a counter reset for
a legitimate wrap is arguably about 0.8% per
1Mbps of maximum bandwidth. Note that this
equates to 80% for 100Mbps interfaces, so
for high bandwidth interfaces and a 32bit
counter, DERIVE with min=0 is probably
preferable. If you are using a 64bit
counter, just about any max setting will
eliminate the possibility of mistaking a
reset for a counter wrap.
In my particular case (and maybe in any large GigEth flow) COUNTER
with max set to 4294967295 should be the solution
Le 28 juin 06 à 15:42, Werner (gmail Lists) a écrit :
(CARE) With RRD it's possible to come around of this OS limitation by feeding the data in shorter times, lets say every 2 minutes. RRD will take care of computing (making) the values "correct" for the steep size used to create the RRD (in hobbit's case 300 secs).
I'm not exactly sure how /and or if hobbit will be happy in receiving client info quicker than every 5 minutes, but i think it should be transparent.
Mmmm. I'd prefer to try to fix the RRD file. May be tricky (export,
import, etc), but more reliable.
Hope this can give you some help.
it definitively helps, thanks !
Nicolas
On Wed, Jun 28, 2006 at 10:20:11AM +0200, Nicolas Dorfsman wrote:
Graphs told me that my hosts don't send/receive more than ~110 Mbits/ s . This raise an alert on the backup server which is installed with a GigEth interface and need to eat many backups flow simulteanously. So, I'm checking other GigEth equipped host. I have some solaris hosts with GigEth. Graphs are inconsistent
between interfaces details and TCP/IP general graphs. Take a look :http://www.unikservice.com/frp/tcpip.png http://www.unikservice.com/frp/i1.png http://www.unikservice.com/frp/i2.png
So, I'm suspecting an issue with collect or graphs. Could somebody
tell me where I should start to debug ?
Let me explain where these data come from.
The first graph ("TCP/IP statistics") are fed by data from the "netstat -s" command. This is (from a Solaris host):
TCP tcpRtoAlgorithm = 4 tcpRtoMin = 400 <snip> tcpCurrEstab = 0 tcpOutSegs =51380214 tcpOutDataSegs =17936799 tcpOutDataBytes =4114388778 <more snip> tcpInSegs =59097243 tcpInAckSegs =19928198 tcpInAckBytes =4108598170 tcpInDupAck =9794396 tcpInAckUnsent = 0 tcpInInorderSegs =34384580 tcpInInorderBytes =1273412387 tcpInUnorderSegs =970394 tcpInUnorderBytes =694993056 tcpInDupSegs = 70767 tcpInDupBytes =20764736
Hobbit tracks the "tcpOutDataBytes" and "tcpInInorderBytes" for the first graph. These are fed into an RRD file which computes the difference between two measurements, and from that it computes an average number of bytes sent over a 5 minute period. For the graph, this is then multiplied by 8 to go from bytes/second to bits/second.
What this means is that Hobbit does not count UDP traffic or other non-TCP traffic in this graph. If you have lots of streaming data which typically uses UDP, this can be a significant amount of data.
Also, it doesn't count out-of-order packets (retransmits, duplicate packets - see your OS documentation to learn exactly what goes into the "tcpInUnorderBytes" counter).
The second graph is fed by data from the Solaris' "kstat" utility, or AIX's "netstat -v" output. As far I understand, this counts raw Ethernet packet bytes - i.e. all protocols. They are fed into RRD files just like the TCP statistics.
So - most likely the difference is in what protocols are counted for each of the graphs.
Regards, Henrik
participants (4)
-
henrik@hswn.dk
-
ndo@unikservice.com
-
olivier.beau@telecomitalia.fr
-
wxxx333@gmail.com