disk graph page limits total file systems
Another update for this topic:
I added 100 file systems to a couple systems to see what would happen with the graphs. The target systems were different from the one that spawned this topic.
When I added 100 file systems to a Linux (RHEL 6.6) system, all file systems were reported/graphed on the disk page. When I added 100 file systems to an AIX (v7.1) system, file systems were truncated; although at a different point. 120 of 132 file systems were represented..
I don't understand why they're so different. Comparing the df portions of the messages from each of the systems do not reveal any obvious differences.
I'd really appreciate some suggestions for debugging. What commands I can run manually that the disk page is running internally. I've looked at the code, but haven't quite figured out what's going on in there; my C skills are rubbish.
Erik D. Schminke | Associate Systems Programmer Hormel Foods Corporation | One Hormel Place | Austin, MN 55912 Phone: (507) 434-6817 edschminke at hormel.com | www.hormelfoods.com
On Mon, May 23, 2016 12:46 pm, EDSchminke at Hormel.com wrote:
Another update for this topic:
I added 100 file systems to a couple systems to see what would happen with the graphs. The target systems were different from the one that spawned this topic.
When I added 100 file systems to a Linux (RHEL 6.6) system, all file systems were reported/graphed on the disk page. When I added 100 file systems to an AIX (v7.1) system, file systems were truncated; although at a different point. 120 of 132 file systems were represented..
I don't understand why they're so different. Comparing the df portions of the messages from each of the systems do not reveal any obvious differences.
I'd really appreciate some suggestions for debugging. What commands I can run manually that the disk page is running internally. I've looked at the code, but haven't quite figured out what's going on in there; my C skills are rubbish.
Hi Erik,
This actually helps a great deal, as it implies there's a distinction in parsing code ... and potentially not an issue on the display side at all (which I've been pouring over with little success).
Can you confirm whether the RRD files themselves are being properly updated for both the AIX and Linux systems? (It might help to disable caching in xymond_rrd during this process, if your system has enough space I/O capacity.) In theory all partitions that are coming in should have their .rrd files updated continually, but if there's a parsing issue then that might explain one aspect of the failure.
Alternatively, can you try adding and removing partition values in the client report and see if going above and below the 85-parition value reliably enables the 86th?
It might be helpful to manually edit the xymonclient-$OS.sh script to grep out (or include additional) lines of the 'df' output.
Can you also confirm that the remainder of the client report (CPU/memory/etc.) is being handled OK, even on the AIX system?
So far I've been unable to duplicate this, but I was primarily testing on x86_64 Linux VMs.
Regards, -jc
JC,
I think I'm starting to see a pattern emerge, and a theory develop, here. Hope everyone is able to follow this... here goes:
I think there may potentially be a disconnect between how the disk page determines how many filesystems SHOULD be graphed and the number of RRD files that are available TO BE graphed. I think the reason the trends page seems to work OK is it just graphs all data that it has available without condition. It seems like the disk page determines 1) the number of filesystems to graph and, 2) based on that number, the number of filesystems per image. These numbers seem to be determined BEFORE it generates the HTML that produces the link HREFs and image SRCs. It then seems to produce just enough graphs to satisfy the predetermined number, plus enough to satisfy the predetermined "multiple",
The predetermined number of filesystems seems to come from the number of filesystems reported in the previous message from the client. I believe the assumption was made those numbers should always match. And for the most part they do. It's not everyday sys admins remove filesystems from their systems. Until now, it may not have been so easy to spot. It's a little more obvious to me, being primarily an AIX administrator. We have a daily process that creates an "alt_disk_copy" of our rootvg so that we always have a hot backup of the OS. This process causes a lot of transient filesystems to be created. Those filesystems get reported and recorded during the brief window that this process is running. On closer examination of my AIX systems, it is not just the one with 85 filesystems getting truncated... it's all of them. When I view HTML source on the disk page, I see that just ahead of the HTML code that displays the graph images and links, there is an HTML comment line: "<!-- linecount=x -->" Where x equals the number of filesystems that were reported in the previous message from the client. (Count number of lines, excluding header, from [df] section.) I went through each of my systems, Linux and AIX, and determined that to always be the case. There must also be some range at which it determines the number "y" that determines how many filesystems to display on each graph. It seems like that number is x<80, y=4 and x>=80, y=5. (If y changes to 6 at some point, I haven't done enough testing to determine where that threshold is.)
The request to showgraph.cgi includes the parameters first=z and count=y. If there are no more RRD files to graph, it stops and the graph shows fewer filesystems than the count parameter. But, if you have a situation where you have more data available than the predetermined number of filesystems, it will continue to graph them.
On the system that previously seem limited to 85 file systems, I modified the "hobbitclient-$os.sh" script and grepped out a certain number of file systems. After doing this, I had 77 filesystems reported. That number was reflected in the "linecount=" HTML comment, and I also began seeing 4 filesystems per graph (instead of 5, previously) and 20 graphs being displayed (instead of 17, previously) for a total of 80 filesystems being graphed. It graphed 80 because it still had enough data from RRD files to round out the last graph. Also, the filesystems that were grepped out of the message from the client, were still graphed.
I also went back and checked my Linux systems; the ones where I added 100 filesystems. On those systems, I created enough filesystems to push past that 85 filesystem "limit". Since those all graphed successfully, I had previously thought that it was the difference between AIX and Linux. That no longer seems to be the case. Now that I have removed all of those test file systems, and since it's only reporting 10 filesystems, only 10 filesystems are being graphed. File systems like /, /boot, and /home are graphed... but the test ones that I removed, are still being graphed, and filesystems that you would expect to see at the end alphabetically, (e.g. /usr, /var, /opt, /tmp, etc) are not displayed.
A lot of speculation, I realize that, but the theory seems to fit reality in all cases. I haven't examined the code to prove it out since, as I've said before, my C skills are rubbish. But if my theory proves to be true, the suggestion for improvement that I would offer is, make sure that at least every file system from the most recent message is represented, plus any additional file systems that might have data available in the time period requested; between "graph_start" and "graph_end".
Erik D. Schminke | Associate Systems Programmer Hormel Foods Corporation | One Hormel Place | Austin, MN 55912 Phone: (507) 434-6817 edschminke at hormel.com | www.hormelfoods.com
From: "J.C. Cleaver" <cleaver at terabithia.org> To: EDSchminke at Hormel.com Cc: "Xymon Mailing List" <xymon at xymon.com> Date: 05/23/2016 10:32 PM Subject: Re: [Xymon] disk graph page limits total file systems
Hi Erik,
This actually helps a great deal, as it implies there's a distinction in parsing code ... and potentially not an issue on the display side at all (which I've been pouring over with little success).
Can you confirm whether the RRD files themselves are being properly updated for both the AIX and Linux systems? (It might help to disable caching in xymond_rrd during this process, if your system has enough space I/O capacity.) In theory all partitions that are coming in should have their .rrd files updated continually, but if there's a parsing issue then that might explain one aspect of the failure.
Alternatively, can you try adding and removing partition values in the client report and see if going above and below the 85-parition value reliably enables the 86th?
It might be helpful to manually edit the xymonclient-$OS.sh script to grep out (or include additional) lines of the 'df' output.
Can you also confirm that the remainder of the client report (CPU/memory/etc.) is being handled OK, even on the AIX system?
So far I've been unable to duplicate this, but I was primarily testing on x86_64 Linux VMs.
Regards, -jc
Hi Ed,
Apologies for the delay, there've been some RL issues getting in the way here.
Thank you for the analysis below; I think you're near the issue here. Looking at lib/htmllog.c:422 et seq, there's even a comment on the possible issues with the line parsing logic. The storage-of-previous-info might be a red herring in that ... I'm not seeing a way that actually gets stored in the first place. On the other hand, the graphs *could* be being affected by something similar: the HG_WITHOUT_STALE_RRDS value.
The line counting looks like it's "reasonable enough", but I could also see complications from unusually-named or unusually-wrapped partitions confusing it about the real number.
I don't have access to an AIX system at the moment, but is there a POSIX-mode or guaranteed-no-line-wrap option for it's 'df' command? If so, the lack of it in $OS.sh is a problem.
Two other ways to test here:
Can you take an existing disk status report and reinjected it, including the HTML comment <!-- linecount=XX --> with the proper number in XX? Per line 431, that value should be used instead of a figure calculated at display time. (This seems like something xymond_client.c might/should include at status-generation time, since we're already going through the values anyway, but it's not at the moment. Probably should be added.)
Secondly, can you add '&nostale' to the RRD graph page loads? That should ensure that partitions are *always* displayed even if the underlying RRD file hasn't been updated recently.
HTH, -jc
On Wed, May 25, 2016 11:40 am, EDSchminke at Hormel.com wrote:
JC,
I think I'm starting to see a pattern emerge, and a theory develop, here. Hope everyone is able to follow this... here goes:
I think there may potentially be a disconnect between how the disk page determines how many filesystems SHOULD be graphed and the number of RRD files that are available TO BE graphed. I think the reason the trends page seems to work OK is it just graphs all data that it has available without condition. It seems like the disk page determines 1) the number of filesystems to graph and, 2) based on that number, the number of filesystems per image. These numbers seem to be determined BEFORE it generates the HTML that produces the link HREFs and image SRCs. It then seems to produce just enough graphs to satisfy the predetermined number, plus enough to satisfy the predetermined "multiple",
The predetermined number of filesystems seems to come from the number of filesystems reported in the previous message from the client. I believe the assumption was made those numbers should always match. And for the most part they do. It's not everyday sys admins remove filesystems from their systems. Until now, it may not have been so easy to spot. It's a little more obvious to me, being primarily an AIX administrator. We have a daily process that creates an "alt_disk_copy" of our rootvg so that we always have a hot backup of the OS. This process causes a lot of transient filesystems to be created. Those filesystems get reported and recorded during the brief window that this process is running. On closer examination of my AIX systems, it is not just the one with 85 filesystems getting truncated... it's all of them. When I view HTML source on the disk page, I see that just ahead of the HTML code that displays the graph images and links, there is an HTML comment line: "<!-- linecount=x -->" Where x equals the number of filesystems that were reported in the previous message from the client. (Count number of lines, excluding header, from [df] section.) I went through each of my systems, Linux and AIX, and determined that to always be the case. There must also be some range at which it determines the number "y" that determines how many filesystems to display on each graph. It seems like that number is x<80, y=4 and x>=80, y=5. (If y changes to 6 at some point, I haven't done enough testing to determine where that threshold is.)
The request to showgraph.cgi includes the parameters first=z and count=y. If there are no more RRD files to graph, it stops and the graph shows fewer filesystems than the count parameter. But, if you have a situation where you have more data available than the predetermined number of filesystems, it will continue to graph them.
On the system that previously seem limited to 85 file systems, I modified the "hobbitclient-$os.sh" script and grepped out a certain number of file systems. After doing this, I had 77 filesystems reported. That number was reflected in the "linecount=" HTML comment, and I also began seeing 4 filesystems per graph (instead of 5, previously) and 20 graphs being displayed (instead of 17, previously) for a total of 80 filesystems being graphed. It graphed 80 because it still had enough data from RRD files to round out the last graph. Also, the filesystems that were grepped out of the message from the client, were still graphed.
I also went back and checked my Linux systems; the ones where I added 100 filesystems. On those systems, I created enough filesystems to push past that 85 filesystem "limit". Since those all graphed successfully, I had previously thought that it was the difference between AIX and Linux. That no longer seems to be the case. Now that I have removed all of those test file systems, and since it's only reporting 10 filesystems, only 10 filesystems are being graphed. File systems like /, /boot, and /home are graphed... but the test ones that I removed, are still being graphed, and filesystems that you would expect to see at the end alphabetically, (e.g. /usr, /var, /opt, /tmp, etc) are not displayed.
A lot of speculation, I realize that, but the theory seems to fit reality in all cases. I haven't examined the code to prove it out since, as I've said before, my C skills are rubbish. But if my theory proves to be true, the suggestion for improvement that I would offer is, make sure that at least every file system from the most recent message is represented, plus any additional file systems that might have data available in the time period requested; between "graph_start" and "graph_end".
Erik D. Schminke | Associate Systems Programmer Hormel Foods Corporation | One Hormel Place | Austin, MN 55912 Phone: (507) 434-6817 edschminke at hormel.com | www.hormelfoods.com
From: "J.C. Cleaver" <cleaver at terabithia.org> To: EDSchminke at Hormel.com Cc: "Xymon Mailing List" <xymon at xymon.com> Date: 05/23/2016 10:32 PM Subject: Re: [Xymon] disk graph page limits total file systems
Hi Erik,
This actually helps a great deal, as it implies there's a distinction in parsing code ... and potentially not an issue on the display side at all (which I've been pouring over with little success).
Can you confirm whether the RRD files themselves are being properly updated for both the AIX and Linux systems? (It might help to disable caching in xymond_rrd during this process, if your system has enough space I/O capacity.) In theory all partitions that are coming in should have their .rrd files updated continually, but if there's a parsing issue then that might explain one aspect of the failure.
Alternatively, can you try adding and removing partition values in the client report and see if going above and below the 85-parition value reliably enables the 86th?
It might be helpful to manually edit the xymonclient-$OS.sh script to grep out (or include additional) lines of the 'df' output.
Can you also confirm that the remainder of the client report (CPU/memory/etc.) is being handled OK, even on the AIX system?
So far I've been unable to duplicate this, but I was primarily testing on x86_64 Linux VMs.
Regards, -jc
participants (2)
-
cleaver@terabithia.org
-
EDSchminke@Hormel.com