On Sun, Oct 14, 2007 at 10:19:22PM -0400, Scott Walters wrote:
One of the most common requests to the trending of data is "How do I make the charts graph data samples which are smaller than 300 seconds?" And the answer has been, you have the source, have fun.
The original design decision that Henrik inherited was larrd should only be for capacity planning and NOT real-time performance analysis. Do one thing and do it well.
I had a thought the other day, and I think we could possibly get the "best of both worlds."
Instead of
$ vmstat 300 2 (resulting in one 300 second sample)
why not
$ vmstat 5 61 (resulting in sixty 5 second samples)
The data would still only be transported every five minutes, but contain more granular samples.
Scott, You have obviously been on the receiving end of LARRD related questions for a long time, so I guess you know what the users have asked for.
I haven't had a lot of requests for more granular data to begin with; most of the requests have been for the fine-grained (5-minute) data to be maintained for a longer period of time than the current 48 hours. In the next version (or the current snapshot), you can define RRA's individually for each type of RRD files. So you can configure the vmstat RRD's to maintain the fine-grained data for a longer time. That should take care of this issue.
I think your idea is worth looking into.
This could not be done for all metrics, but many. This would also require the RRAs of the all the RRDs be re-made (export, re-create, import). But I've that's been on my mind anyway cause the original RRA structure was based on screen sizes for 800x600, instead of business requirements.
If I understand your suggestion correctly, you would change the client to run "vmstat 5 61" (for instance), collect all 60 samples, and then send them off to Hobbit every 5 minutes. So we would essentially be caching data for 5 minutes on the client, then send it off to the Hobbit server and do a single multi-update of the RRD data when it arrives.
One complication with this is that Hobbit needs to determine the timestamps for each of the samples, because RRDtool needs each measurement timestamped. In the current setup, Hobbit just uses the time that the data arrives from the client - this will be "close enough" to the time the measurement was done to work. But if the client caches the data for some amount of time, we have to find a way of generating the correct timestamps. Just having the client timestamp it with its own local time won't work - there are too many hosts where the clocks are way off. I guess this could be done by having the client timestamp the data, but then use these as relative timestamps (so we can see sample 10 was done 236 seconds before the last sample) and then work out the exact timestamps over on the Hobbit server, like we do today.
This could be done - it would require a bit of change to the clients, but I'm not really happy with the current way the vmstat data collection works (it usually leaves a vmstat process hanging around when the client is stopped), so I wouldn't mind having to do some code for this. I'd probably write a small tool to run "vmstat 60" so it runs forever, and then the tool would pick up the data, timestamp it and then regularly feed it into the client report.
And of course the server-end would need changing to accomodate the new data format and the multiple updates. It's certainly doable, without a whole lot of re-designing.
But I think we should consider which datasets one might want to have these frequent updates for. vmstat is obvious; but what about memory utilisation? Disk utilisation rarely changes rapidly - or perhaps it does ? Process counts? Network test response times ? Once we start doing it for vmstat, I'd expect everyone to come forward and ask for it for lots of other datasets - so instead of doing a quick hack just for vmstat, we should consider what would be the "right" way of doing it for all/most of the data.
Henrik, do you follow my thinking? It's kinda hard for me to believe it's taken me over five years to think of this!
Things take time - and you often don't get it right until the third try.
My biggest concern is not the technical details of the collectors and RRD/RRA restructuring, but inflicting resource usage on servers measuring themselves.
$ vmstat 1 301 would definitely be a bad idea.
Agreed - but I don't think that should be something Hobbit decides. I can easily imagine a scenario where you would do that for some troubleshooting situation, and if that is what is needed then Hobbit should let you do it. No reason to setup arbitrary restrictions. (This is in line with Unix thinking - "if you insist on shooting your foot off, it's your decision to do so". Just as "rm -rf /" is not recommended, but still possible).
Regards, Henrik