Hi,
I wanted to move from a 5 minute interval on all my clients to a 1 minute interval.
I went to my first AIX host to test, and changed /usr/local/hobbit/client/etc/clientlaunch.cfg and changed the interval to 1m (I assume this is correct)
Well, sure enough, hobbit launches stuff every minute, but the problem is that I see "vmstat 300 2" running a ton. So looking at /usr/local/hobbit/client/bin/hobbitclient-aix.sh, I see that hardcoded into the script is "vmstat 300 2" So do I need to update that to reflect 1 minute as well ( i.e. vmstat 60 2)? Or is this by design? Are there others that might need to change that I don't know about? Is the way I am going about this wrong?
Thanks, Jeff
On Mon, Dec 12, 2005 at 01:12:20PM -0600, Jeff Newman wrote:
I wanted to move from a 5 minute interval on all my clients to a 1 minute interval.
I went to my first AIX host to test, and changed /usr/local/hobbit/client/etc/clientlaunch.cfg and changed the interval to 1m (I assume this is correct)
Yep.
Well, sure enough, hobbit launches stuff every minute, but the problem is that I see "vmstat 300 2" running a ton. So looking at /usr/local/hobbit/client/bin/hobbitclient-aix.sh, I see that hardcoded into the script is "vmstat 300 2" So do I need to update that to reflect 1 minute as well (i.e. vmstat 60 2)? Or is this by design? Are there others that might need to change that I don't know about? Is the way I am going about this wrong?
That's an interesting question :-)
The graph DB's that vmstat feeds data into (the RRD files) are constructed in such a way that a 5-minute interval is what makes sense. So running them with anything else really just a waste of ressources.
(I do have a patch here from a user that would allow you to configure the RRD files for different data-collection frequencies, but that has not been merged yet - primarily due to me being overloaded).
So no - you shouldn't change that vmstat command. But it is bad design on my part to assume that the client polling period would always be 5 minutes - it's perfectly valid to run the client checks differently.
I'll think about what's the most sensible solution. It probably would be to only start the vmstat command if one isn't running; that does assume that you will run the client scripts *at least once* every 5 minutes.
Henrik
On Mon, Dec 12, 2005 at 01:12:20PM -0600, Jeff Newman wrote:
I wanted to move from a 5 minute interval on all my clients to a 1
minute interval.
In all my years of Systems Administration, things that run every
minute all the time usually end up being a "Bad Idea".
How will a smaller sampling period improve the service you provide?
the script is "vmstat 300 2" So do I need to update that to
reflect 1 minute as well (i.e. vmstat 60 2)? Or is this by design? Are there others that might need to change
that I don't know about? Is the way I am going about this wrong?That's an interesting question :-)
My job requires data be useful, not just interesting. That is not to
say there aren't jobs were useful is good enough.
The graph DB's that vmstat feeds data into (the RRD files) are constructed in such a way that a 5-minute interval is what makes sense. So running them with anything else really just a waste of ressources.
With the stock larrd/hobbit RRD definitions you are correct. He'll
only use one of the five, and whine about the timestamp of the other
four.
(I do have a patch here from a user that would allow you to configure the RRD files for different data-collection frequencies, but that has not been merged yet - primarily due to me being overloaded).
The design goal of larrd, (I can't speak for Henrik and hobbit/RRD)
was capacity planning and trending. 5m samples are more than
adequate for that activity.
IMO, sampling at a high frequency implies real-time performance
analysis, and I've always felt that outside the scope of capacity
planning and trending. EG. We don't run sendmail in debug all the
time . . . .
All that being said, those long term trends are very helpful for
problem resolution. One can compare a single 5m sample against an
aggregate of 5m samples and determine if things are 'normal'. But
the art of comparing all the activity within a single 5m sample for
normal is very very difficult.
So no - you shouldn't change that vmstat command. But it is bad design on my part to assume that the client polling period would always be 5 minutes - it's perfectly valid to run the client checks differently.
That's my design you inherited and because of the complexity of the
parts, I think it is a very solid design. To become flexible enough
to handle different sampling rates, the server would need to know the
frequency of the tests. And then changing the RRD in the future is
'almost' impossible (very difficult at the least). And I've never
seen what happens to 1.5 years of data when you start messing with
the RRD.
In the end, I think you'd get the worst of both worlds.
I'll think about what's the most sensible solution. It probably
would be to only start the vmstat command if one isn't running; that does
assume that you will run the client scripts *at least once* every 5 minutes.
I disagree. If real-time performance analysis is needed, I would
pick other tools -- "vmstat 5" works for me;) Or construct/fork
the client agent specifically designed for such a task, and run it on
an as-needed basis.
Then try and decide for real time perf analysis if the sampling rate
should be 5s or 1m ;)
scott
In message <139E0D6D-28B0-4576-A033-3525AD2970CA at PacketPushers.com>, Scott Walters writes:
}
}> On Mon, Dec 12, 2005 at 01:12:20PM -0600, Jeff Newman wrote:
}>>
}>> I wanted to move from a 5 minute interval on all my clients to a 1
}>> minute
}>> interval.
}>>
}
}In all my years of Systems Administration, things that run every
}minute all the time usually end up being a "Bad Idea".
}
}How will a smaller sampling period improve the service you provide?
We run pretty much all of our big brother tests every minute. On our new hobbit servers, we're running them at the default intervals.
BB shows us that our primary name server is going out for less than a minute, about every 62 minutes. Hobbit is missing most of those outages, although the longer "xxxx events received in the last xxx minutes" is what helped us spot the problem, as a whole bunch of machines' services don't respond well when our primary name server is out, and having a mass of servers go yellow then green, in unison, is sort of eye catching.
Tracy J. Di Marco White Information Technology Services Iowa State University
We run pretty much all of our big brother tests every minute. On our new hobbit servers, we're running them at the default intervals.
BB shows us that our primary name server is going out for less than a minute, about every 62 minutes. Hobbit is missing most of those outages, although the longer "xxxx events received in the last xxx minutes" is what helped us spot the problem, as a whole bunch of machines' services don't respond well when our primary name server is out, and having a mass of servers go yellow then green, in unison, is sort of eye catching.
So hobbit with the xxx events (running every 5m) did provide enough
information to indicate an intermittent problem with DNS?
Things running every 5m will collide with a problem that happens for
a minute frequently enough to 'show up on the radar'
But every site has different requirements. It's just been my
experience that sampling more frequently than 5m hits the knee-bend
of diminishing returns. It also increases the potential for state
changes, which chews up the filesystem with the history info.
ymmv
scott
In message <F098822C-19A2-42CC-B6BA-2AB4E71D18BA at PacketPushers.com>, Scott Walters writes:
}>
}> We run pretty much all of our big brother tests every minute. On
}> our new hobbit servers, we're running them at the default intervals.
}>
}> BB shows us that our primary name server is going out for less than
}> a minute, about every 62 minutes.
}> Hobbit is missing most of those
}> outages, although the longer "xxxx events received in the last xxx
}> minutes" is what helped us spot the problem, as a whole bunch of
}> machines' services don't respond well when our primary name server
}> is out, and having a mass of servers go yellow then green, in
}> unison, is sort of eye catching.
}
}So hobbit with the xxx events (running every 5m) did provide enough
}information to indicate an intermittent problem with DNS?
Hobbit's non-green page, with last xxx events, gave us a large enough view that we could see all the machine services going yellow at the same time dns went red. We're monitoring a bit over 260 machines with a whole lot of difference services, so there's often something going red or yellow. With BB's older default of the last 25 events, there wasn't ever that much on screen to notice a group of swings to yellow, then back to green.
}Things running every 5m will collide with a problem that happens for
}a minute frequently enough to 'show up on the radar'
Sure, but we'd see up to 13 hours between dns 'red', when BB would get several in that period.
I haven't changed hobbit yet to 1 minute checks. I've even made an explicit explanation that I wasn't planning to shorten it to 1 minute checks when we officially switched over, and that was agreed to. However, with the fact that the 1 minute checks did actually make a difference in tracking down and solving the problem with DNS, I may yet have to work on that change. We'll see what kind of feedback I get after today. Even then, the only thing I'd really be willing to shorten to that frequency of checks are the remote checks, over the network.
}But every site has different requirements. It's just been my
}experience that sampling more frequently than 5m hits the knee-bend
}of diminishing returns. It also increases the potential for state
}changes, which chews up the filesystem with the history info.
I thought it was unnecessary when I originally brought BB into production years ago, but it was one of the requirements I ended up with to sell switching to BB. Some things can't be checked every minute, I have raid checks that can take more than a minute to run.
Tracy J. Di Marco White Information Technology Services Iowa State University
Sure, but we'd see up to 13 hours between dns 'red', when BB would get several in that period.
I haven't changed hobbit yet to 1 minute checks. I've even made an explicit explanation that I wasn't planning to shorten it to 1 minute checks when we officially switched over, and that was agreed to. However, with the fact that the 1 minute checks did actually make a difference in tracking down and solving the problem with DNS, I may yet have to work on that change.
It sounds like your shop is so tidy that you would have found it and
fixed it anyway. It was just a little brighter with the shorter
interval.
We'll see what kind of feedback I get after today. Even then, the only thing I'd really be willing to shorten to that frequency of checks are the remote checks, over the network.
I believe hobbit has great 're-test' logic. So if it is down, it
will test more frequently . . . .
I thought it was unnecessary when I originally brought BB into production years ago, but it was one of the requirements I ended up with to sell switching to BB. Some things can't be checked every minute, I have raid checks that can take more than a minute to run.
On the client, 1m samples are opening Pandora's Box . . . .
scott
In message <7921F8E0-E87F-4A95-895C-A38ABB1D7831 at PacketPushers.com>, Scott Walters writes:
}>
}> Sure, but we'd see up to 13 hours between dns 'red', when BB would
}> get several in that period.
}>
}> I haven't changed hobbit yet to 1 minute checks. I've even made
}> an explicit explanation that I wasn't planning to shorten it to
}> 1 minute checks when we officially switched over, and that was
}> agreed to. However, with the fact that the 1 minute checks did
}> actually make a difference in tracking down and solving the problem
}> with DNS, I may yet have to work on that change.
}
}It sounds like your shop is so tidy that you would have found it and
}fixed it anyway. It was just a little brighter with the shorter
}interval.
I agree we would have found it. I'm amused at the thought of our shop being tidy, but thanks.
}> We'll see what
}> kind of feedback I get after today. Even then, the only thing I'd
}> really be willing to shorten to that frequency of checks are the
}> remote checks, over the network.
}
}I believe hobbit has great 're-test' logic. So if it is down, it
}will test more frequently . . . .
And that's how I sold the 5 minute testing interval. And how I think I'll not have to shorten the interval for hobbit now.
}> I thought it was unnecessary when I originally brought BB into }> production years ago, but it was one of the requirements I ended }> up with to sell switching to BB. Some things can't be checked }> every minute, I have raid checks that can take more than a }> minute to run. } }On the client, 1m samples are opening Pandora's Box . . . .
Everyone really needs to consider what all the effects are of the frequency of the monitoring.
Tracy J. Di Marco White Information Technology Services Iowa State University
On Thu, 15 Dec 2005, Tracy J. Di Marco White wrote:
Everyone really needs to consider what all the effects are of the frequency of the monitoring.
I understand a more frequent sampling period is an easy sell, but I don't think it is a valid one when the rubber meets the road.
Plus, I try and make sure all technical decisions have a business reason.
I dislike technology and its advocates that try and drive the business. I guess I've gotten old and 'kewl' is no longer good enough ;)
Businessmen don't think in terms of technology. It's our job as professionals to make technology help the business. If we cannot clearly articulate how technology (or architecture changes) can help the business, it probably won't.
Unfortunately, mailing lists are not the best forum for these discussions.
-- Scott Walters -PacketPusher
Scott,
I wanted to respond to you regarding technical reasons on a decreased interval.
I agree that in most cases where people would want an increase in frequency it would be for real-time performance analysis, whereas hobbit/bb are more for capacity planning/trending.
In my business, we deal with recieving all financial data and pushing that data around servers. a graph would have little data until the stock market opens, then the floodgates open :-) The graph then fluctuates with another surge at market close.
The interval being at 1 minute for specifically CPU and network is important to us for capacity planning purposes because during, say, market open, there are huge peaks that a 5m interval doesn't catch. We need to plan capacity based around those spikes, as those are indicative of future market trends in stock volume. It's not that the 5m interval does nothing, indeed it is helpful, but from a business perspective, a 1m interval allows us to plan capacity because it helps us catch the spikes that we want to see.
So something like a low-interval cpu/network column would be beneficial. Those tests could use seperate rrd files etc...
I recently integrated mrtg into hobbit. I assume that the 5m interval "issue" (not really an issue I know) exists with it as well since it utilizes the same rrd structure? Or can I set the interval of mrtg to be 1 minute? That would solve my networking interval problem.
Anyway, I hope I have explained the business reason well enough, feel free to ask any questions. I feel that while not all circumstances are ideal for a 1m polling sample, there are some situations where this is ideal.
-Jeff
On 12/15/05, Scott Walters <scott at packetpushers.com> wrote:
On Thu, 15 Dec 2005, Tracy J. Di Marco White wrote:
Everyone really needs to consider what all the effects are of the frequency of the monitoring.
I understand a more frequent sampling period is an easy sell, but I don't think it is a valid one when the rubber meets the road.
Plus, I try and make sure all technical decisions have a business reason.
I dislike technology and its advocates that try and drive the business. I guess I've gotten old and 'kewl' is no longer good enough ;)
Businessmen don't think in terms of technology. It's our job as professionals to make technology help the business. If we cannot clearly articulate how technology (or architecture changes) can help the business, it probably won't.
Unfortunately, mailing lists are not the best forum for these discussions.
-- Scott Walters -PacketPusher
To unsubscribe from the hobbit list, send an e-mail to hobbit-unsubscribe at hswn.dk
Sorry, one more thing (don't mean to add to message volume)
I discovered that if I updated the hobbitlaunch.cfg to have mrtg start at 1m intervals, AND specified Interval: 1 in the mrtg.cfg file, hobbit handles it just fine (draws the graphs with the correct 1m deliniations and updates accordingly)
-Jeff
On 12/23/05, Jeff Newman <jeffnewman75 at gmail.com> wrote:
Scott,
I wanted to respond to you regarding technical reasons on a decreased interval.
I agree that in most cases where people would want an increase in frequency it would be for real-time performance analysis, whereas hobbit/bb are more for capacity planning/trending.
In my business, we deal with recieving all financial data and pushing that data around servers. a graph would have little data until the stock market opens, then the floodgates open :-) The graph then fluctuates with another surge at market close.
The interval being at 1 minute for specifically CPU and network is important to us for capacity planning purposes because during, say, market open, there are huge peaks that a 5m interval doesn't catch. We need to plan capacity based around those spikes, as those are indicative of future market trends in stock volume. It's not that the 5m interval does nothing, indeed it is helpful, but from a business perspective, a 1m interval allows us to plan capacity because it helps us catch the spikes that we want to see.
So something like a low-interval cpu/network column would be beneficial. Those tests could use seperate rrd files etc...
I recently integrated mrtg into hobbit. I assume that the 5m interval "issue" (not really an issue I know) exists with it as well since it utilizes the same rrd structure? Or can I set the interval of mrtg to be 1 minute? That would solve my networking interval problem.
Anyway, I hope I have explained the business reason well enough, feel free to ask any questions. I feel that while not all circumstances are ideal for a 1m polling sample, there are some situations where this is ideal.
-Jeff
On 12/15/05, Scott Walters <scott at packetpushers.com> wrote:
On Thu, 15 Dec 2005, Tracy J. Di Marco White wrote:
Everyone really needs to consider what all the effects are of the frequency of the monitoring.
I understand a more frequent sampling period is an easy sell, but I don't think it is a valid one when the rubber meets the road.
Plus, I try and make sure all technical decisions have a business reason.
I dislike technology and its advocates that try and drive the business. I guess I've gotten old and 'kewl' is no longer good enough ;)
Businessmen don't think in terms of technology. It's our job as professionals to make technology help the business. If we cannot clearly articulate how technology (or architecture changes) can help the business, it probably won't.
Unfortunately, mailing lists are not the best forum for these discussions.
-- Scott Walters -PacketPusher
To unsubscribe from the hobbit list, send an e-mail to hobbit-unsubscribe at hswn.dk
This helps *immensely*. Now we'll be able to justify shiny new gear
to management to reliably provide an IT infrastructure capable of
meeting the long term growth of trade volumes.
On Dec 23, 2005, at 10:19 AM, Jeff Newman wrote:
servers. a graph would have little data until the stock market
opens, then the floodgates open :-) The graph then fluctuates with another surge at market close.
gotcha
The interval being at 1 minute for specifically CPU and network is
important to us for capacity planning purposes because during, say, market open,
there are huge peaks that a 5m interval doesn't catch. We need to plan capacity based
around those spikes, as those are indicative of future market
trends in stock volume. It's not that the 5m interval does nothing,
indeed it is helpful, but from a business perspective, a 1m
interval allows us to plan capacity because it helps us catch the
spikes that we want to see.
Absolutely. I am glad to hear you aware you must plan for the peaks.
Busy doesn't mean slow.
The server stats are generally only 1/2 the equation. They are the
impact on the machine. Ideally, for these types of situations, you
are also able to measure the load E.G. trade volumes and their
average execution times.
Knowing the RPMs of your motor doesn't tell you you MPH. If you
could see/prove that when CPU is 100% execution times can grow
outside of SLAs, its easier to convince management you need a bigger/
better environment and/or testing/QA/integration.
I hear there's a few nickels on Wall Street ;)
So something like a low-interval cpu/network column would be
beneficial. Those tests could use seperate rrd files etc...
I am still going to argue this isn't the right way to measure the
data in your environment to provide the information you are looking for.
RRD makes the presumption the older data gets, the less important
it is. In your case that is *not true*. Each 'peak' is a set of
data where the granularity needs to be preserved. So even if the RRA
gets configured to keep 1m samples, which might help 'see' the last
2 days or so of peaks, it won't help when you want to review the
data set of Black Monday. Those peaks will have been averaged down.One cannot assume causation of UNIX statistics and performance in
the business environment. If you need to know your servers will
handle 5 million trades in 5 minutes, you need throw 5 million trades
at the boxes and see what happens. "If it ain't tested, it doesn't
work."When environments reach bottlenecks, it's impossible to say what
the real peak is. If your CPU is at 100%, one cannot know (without
testing) what the real demand for CPU is . . . .It's the always the code/SQL/CICS anyway ;)
I recently integrated mrtg into hobbit. I assume that the 5m
interval "issue" (not really an issue I know) exists with it as
well since it utilizes the same rrd structure? Or can I set the
interval of mrtg to be 1 minute? That would solve my networking
interval problem.
But that is only one sample per minute. For your application, you
need something *much* more granular.
Anyway, I hope I have explained the business reason well enough,
feel free to ask any questions. I feel that while not all
circumstances are ideal for a 1m polling sample, there are some situations where this is ideal.
You have legitimate business needs for sure, and an idea for a
feature which would be very *useful*.
A high interval/sampling for 'stress testing' impact with the data
being preserved.
That would be a great addition hobbit. I am not sure if RRD is the
right backend, but it might work if the solution is clever.
I'll let it rattle around . . . .
scott
On 12/13/05, Scott Walters <scott at packetpushers.com> wrote:
In all my years of Systems Administration, things that run every minute all the time usually end up being a "Bad Idea".
How will a smaller sampling period improve the service you provide?
It can be a bad idea sometimes, others not (for example, the reply from the person catching intermittant problems with BB running every minute)
A smaller sampling period can show things in a more granular aspect. For example, a process kicks off and 5 minutes later you see 100 errors (im keeping things generic for illustrative purposes) Were those 100 errors in the first minute? the last? constantly throughout the 5 minutes?
Im not saying your wrong, simply pointing out that it's not as black and white as your making it.
My job requires data be useful, not just interesting. That is not to
say there aren't jobs were useful is good enough.
Something being just interesting initially can sometimes uncover problems that you didn't see before.
The graph DB's that vmstat feeds data into (the RRD files) are
constructed in such a way that a 5-minute interval is what makes sense. So running them with anything else really just a waste of ressources.
With the stock larrd/hobbit RRD definitions you are correct. He'll only use one of the five, and whine about the timestamp of the other four.
Firstly, can you explain your comment in more detail? Secondly, im confused as to why you would state that I would "whine" about anything when you have no basis for a conclusion to that effect. It seems to be a rather pointed comment in a discussion that hasn't involved the use of language that would dictate a response like that.
. The design goal of larrd, (I can't speak for Henrik and hobbit/RRD)
was capacity planning and trending. 5m samples are more than adequate for that activity.
IMO, sampling at a high frequency implies real-time performance analysis, and I've always felt that outside the scope of capacity planning and trending. EG. We don't run sendmail in debug all the time . . . .
All that being said, those long term trends are very helpful for problem resolution. One can compare a single 5m sample against an aggregate of 5m samples and determine if things are 'normal'. But the art of comparing all the activity within a single 5m sample for normal is very very difficult.
That is a very good point you make. There is a difference between real-time analysis and capacity planning/trending. I don't however think that it is that far outside of hobbit's scope to try and leverage it for a more pointed analysis. My goal isn't to take every machine in my environment and make them into 1 minute sampling period machines. To have the ability to do so on a machine-by-machine basis could be useful
That's my design you inherited and because of the complexity of the parts, I think it is a very solid design.
I don't think anyone is really questioning that.
To become flexible enough
to handle different sampling rates, the server would need to know the frequency of the tests. And then changing the RRD in the future is 'almost' impossible (very difficult at the least). And I've never seen what happens to 1.5 years of data when you start messing with the RRD.
In the end, I think you'd get the worst of both worlds.
Honestly, I don't claim to know anything about the way larrd and hobbit are coded in the slightest. There are difficulties to be sure, but part of having a community such as this is to foster ideas and innovation. Just because you don't think it's useful or that it's hard doesn't mean the same is true for everyone out there. What if you could add a high-frequency tag to a server and it generates a seperate high-frequncey graph for that, as well as updating the normal trend graph for whatever resource you wanted? That way you could choose for a day to look at a graph for resource x every minute for a day then turn it off? There are lots of ideas and I don't know if mine would even work, but you shouldn't just kill the idea.
I disagree. If real-time performance analysis is needed, I would
pick other tools -- "vmstat 5" works for me;) Or construct/fork the client agent specifically designed for such a task, and run it on an as-needed basis.
There are other tools yes. I am trying to leverage hobbit. If it's not possible and nobody wants to do it, then yes, ill look into other tools. On the same token I don't want to kill my performance by running lots of different monitoring on the server. Hobbit is extrodinarily lightweight on the client (as opposed to other solutions out there) so I think something like this is possible without overloading a client.
Just my two cents.
On Wed, 2005-12-14 at 17:31 -0600, Jeff Newman wrote:
On 12/13/05, Scott Walters <scott at packetpushers.com> wrote:
>> The graph DB's that vmstat feeds data into (the RRD files) are >> constructed in such a way that a 5-minute interval is what makes >> sense. So running them with anything else really just a waste of >> ressources. > With the stock larrd/hobbit RRD definitions you are correct. He'll > only use one of the five, and whine about the timestamp of the other > four.Firstly, can you explain your comment in more detail? Secondly, im confused as to why you would state that I would "whine" about anything when you have no basis for a conclusion to that effect. It seems to be a rather pointed comment in a discussion that hasn't involved the use of language that would dictate a response like that.
I think he was referring to the error message that would be generated because of the extra data compared to the interval configured in the rrd files.
> To become flexible enough > to handle different sampling rates, the server would need to know the > frequency of the tests. And then changing the RRD in the future is > 'almost' impossible (very difficult at the least). And I've never > seen what happens to 1.5 years of data when you start messing with > the RRD. > In the end, I think you'd get the worst of both worlds.
Well, we could take this the other way, and say that we only wanted to run our tests once every 10 minutes, because it was causing too much overhead to run the tests every 5 minutes. How would we deal with that?
I think there are benefits, on the client side, the client should pass the frequency which it is calling the tests at, to them, so that for example, the vmstat test can adjust how long it will run for to either 600 seconds, or 60 seconds, or whatever else is needed.
Further to that, there is some additional work (which I feel is the real place that all the work is involved, the client side stuff would be quite simple, or so it sounds). On the server, we either need to re-create/adjust the rrd file so that we can insert data more frequently, or else we need to somehow summarise the data before insertion (which means there was no benefit in collecting the data more frequently in the first place). So, the question becomes, how difficult is it to convert an rrd file, which was initially created to store data-points every 300 seconds, such that we can now store data-points every X seconds? The second part to this question, is how does hobbit know how frequently you want to send your reports? ie, it can't be based on 'however often they are received', because that value would change very frequantly, ie, the reports are done every 300 seconds, but by the time the report is submitted/processed by hobbit, it might be 1 or 2 seconds late/early compared to last time..... Could hobbit server 'learn' the frequency from the client (which is where this is configured anyway), because the client would report that value to the server as a part of the vmstat output?
Of course, even once both of those questions are satisfactorily answered, you (yes, you) need to convince somebody to take the effort to actually do it. Simply seeing the methods, and the interest some people have taken in the possibility might be enough for someone with the coding skills to do it, or, sometimes you might need to provide other incentives (even paid). Let me state clearly right now before I go on, no, don't pay me, I don't have that level of coding skills in C to do it for you, nor the knowledge of RRD. No, I don't know who you might pay, or anything else, I have no interest in any of it, I'm just stating a simple fact of life (you want something done, you can't do it, so find someone who can do it, and motivate them to the point where they will do it whether they want to or not)...
> I disagree. If real-time performance analysis is needed, I would > pick other tools -- "vmstat 5" works for me;) Or construct/fork > the client agent specifically designed for such a task, and run it on > an as-needed basis.There are other tools yes. I am trying to leverage hobbit. If it's not possible and nobody wants to do it, then yes, ill look into other tools. On the same token I don't want to kill my performance by running lots of different monitoring on the server. Hobbit is extrodinarily lightweight on the client (as opposed to other solutions out there) so I think something like this is possible without overloading a client.
Agreed, it would be nice to not have to run hobbit plus something else when they are both collecting the same data (just a different frequency).
Just my two cents.
Here's a couple of mine also :)
Regards, Adam
Well, we could take this the other way, and say that we only wanted to run our tests once every 10 minutes, because it was causing too much overhead to run the tests every 5 minutes. How would we deal with
that?
Same issue, the server needs to know the sampling rate when the RRD
is created.
I think there are benefits, on the client side, the client should pass the frequency which it is calling the tests at, to them, so that for example, the vmstat test can adjust how long it will run for to either 600 seconds, or 60 seconds, or whatever else is needed.
As long as those never need to change, that wouldn't be too bad. But
then you run into the display logic needing help depending on the
granularity of the data/RRAs in the RRDs.
Further to that, there is some additional work (which I feel is the
real place that all the work is involved, the client side stuff would be quite simple, or so it sounds).
Yes.
So, the question becomes, how difficult is it to convert an rrd file, which was initially created to store data-points every 300 seconds,
such that we can now store data-points every X seconds?
I am not aware of a way to change the granularity of RRAs (the
things inside the RRDs) once they are created. You'd have to rrdtool
export; create a new rrd with different RRA's, then rrdtool import.
Basically export/import the database. You can't even add an RRA to
an existing RRD.
The second part to this question, is how does hobbit know how
frequently you want to send your reports? ie, it can't be based on 'however often they are received', because that value would change very
frequantly, ie, the reports are done every 300 seconds, but by the time the report is submitted/processed by hobbit, it might be 1 or 2 seconds late/early compared to last time..... Could hobbit server 'learn' the frequency from the client (which is where this is configured anyway), because
the client would report that value to the server as a part of the vmstat output?
Yes, but that is not what makes all this really hard, it's the server
logic. I can think of ways to do it, but it would involve a lot of
changes to the server side parsing, many small client changes,
restructuring/redefining existing rrds, and some potentially hairy
presentation logic to make the server smarter about what to show
based on what is in the RRD. I wrote larrd with Christian, and I
can tell you, this would not be a weekend hack.
Time Series Data (telemetry data) is all about data on regular
intervals. Changing that regular interval is a very significant thing.
or anything else, I have no interest in any of it, I'm just stating a simple fact of life (you want something done, you can't do it, so find someone who can do it, and motivate them to the point where they
will do it whether they want to or not)...
The real trick there is convincing them they want to do it. Forcing
someone to do something might work, but is no good over the long term.
Agreed, it would be nice to not have to run hobbit plus something else when they are both collecting the same data (just a different frequency).
I'm tellin' ya: vmstat 5
scott
First off, I know I can come off terse in e-mail, but they are not
personal attacks.
It can be a bad idea sometimes, others not (for example, the reply
from the person catching intermittant problems with BB running every
minute)
Who ended up stating the anomaly *was* detected in 5m intervals, but
only once every 13h instead of every hour. But I still don't
understand how it will help *you*.
A smaller sampling period can show things in a more granular
aspect. For example, a process kicks off and 5 minutes later you
see 100 errors (im keeping things generic for illustrative
purposes) Were those 100 errors in the first minute? the last?
constantly throughout the 5 minutes?
The 5m averages over a week would be quite low compared so a single
5m plot. From that, one could extrapolate in the last 5m things have
not been 'normal'.
Im not saying your wrong, simply pointing out that it's not as
black and white as your making it.
And I am disagreeing with you ;) I've been watching the data in
these graphs for many many years now, and I have yet to come across a
situation where having a 1m sampling/graphing period would have
helped me fix/improve something . . .
It's like a story problem with too much information, it makes coming
up with the real answer harder in the end. Most people don't have
time/enegry/brains to be able to sift all the data correctly. If if
they do, the 5m samples are good enough.
Most people (including really smart people that are forgetful) can't
deal with an auto-scaling y-axis.
Something being just interesting initially can sometimes uncover
problems that you didn't see before.
Like I said, if you have job were interesting is worthwhile,
wonderful. In my experience, most folks that are running the BB/
hobbit tools are involved in the operational aspects of
infrastructure, not R&D.
With the stock larrd/hobbit RRD definitions you are correct. He'll only use one of the five, and whine about the timestamp of the other four.
Firstly, can you explain your comment in more detail?
RRD interpolates Time Series Data to put a value at a fixed
interval. That is why you hardly ever see integers in the data. If
you sample comes in at 299s, RRD interpolates what that value to what
would have been at 300s. How this is done can be tuned. The default
settings with the RRAs expect data to happen every 300s. RRD will
only insert data one time within that interval.
Secondly, im confused as to why you would state that I would "whine" about
anything when you have no basis for a conclusion to that effect. It seems to
be a rather pointed comment in a discussion that hasn't involved the use of
language that would dictate a response like that.
"He'll whine" meant rrdtool, not you:
ERROR: illegal attempt to update using time 1042731000 when last
update time
is 1043099100 (minimum one second step)
That's whining in my book. Sorry you thought I was speaking about you.
That is a very good point you make. There is a difference between real-time analysis and capacity planning/trending. I don't however
think that it is that far outside of hobbit's scope to try and leverage
it for a more pointed analysis.
From a software development standpoint there is a lot to be said
for: "Do one thing and do it well". If architecting the RRD
framework for RTA breaks trending, bad idea.
My goal isn't to take every machine in my environment and make them into 1 minute sampling period machines. To have the
ability to do so on a machine-by-machine basis could be useful
Which is why I proposed another client collector for this activity.
That's my design you inherited and because of the complexity of the parts, I think it is a very solid design.
I don't think anyone is really questioning that.
You are questioning that. And that is fine. I don't take it
personally you think there may be a better way. I know my way may
not be the best, but I sure know exactly *why* I chose it.
Honestly, I don't claim to know anything about the way larrd and
hobbit are coded in the slightest. There are difficulties to be sure, but
part of having a community such as this is to foster ideas and innovation. Just
because you don't think it's useful or that it's hard doesn't mean the same is
true for everyone out there.
Ahhhhh, to the heart of the matter. Don't suggest ideas in a public
forum if you are not prepared to defend them. Fostering ideas comes
from intelligent discussions. I merely wanted to understand why you
felt you needed a higher sampling rate from a business perspective.
scott
On Wed, Dec 14, 2005 at 05:31:28PM -0600, Jeff Newman wrote:
On 12/13/05, Scott Walters <scott at packetpushers.com> wrote:
The graph DB's that vmstat feeds data into (the RRD files) are
constructed in such a way that a 5-minute interval is what makes sense. So running them with anything else really just a waste of ressources.
With the stock larrd/hobbit RRD definitions you are correct. He'll only use one of the five, and whine about the timestamp of the other four.
Firstly, can you explain your comment in more detail? Secondly, im confused as to why you would state that I would "whine" about anything when you have no basis for a conclusion to that effect. It seems to be a rather pointed comment in a discussion that hasn't involved the use of language that would dictate a response like that.
Jeff, I think you misunderstood what Scott wrote. The "he" that is doing the whining is the rrdtool library; if you feed data into an rrd file more often than the minimum interval between updates, it will complain about this in the logs and just ignore the extra updates.
The design goal of larrd, (I can't speak for Henrik and hobbit/RRD) was capacity planning and trending. 5m samples are more than adequate for that activity.
[interesting discussion about using hobbit for capacity-planning vs. real-time analysis snipped] The Hobbit design - as far as the graphing and trending is concerned - was really just to re-implement the LARRD features. So in that respect I have adopted Scott's design goals - even though I wasn't aware what they were. However, that doesn't mean hobbit cannot be leveraged to support other uses for the data we collect about our systems. As Jeff writes:
part of having a community such as this is to foster ideas and innovation. [snip] What if you could add a high-frequency tag to a server and it generates a seperate high-frequncey graph for that
I've picked up quite a few ideas from the discussions that have occurred here, and Hobbit wouldn't be as good as it is without it. So please - keep those ideas coming, even though they might seem to be "off-topic" or just plain weird. There's no guarantee that I'll use any of it, but it is still interesting to discuss. I actually think that Hobbit could support both uses, e.g. in the way that Jeff suggests with a special high-frequency graph, in addition to the normal ones. Hobbit does have several building-blocks that you could use to implement this: - a method for the client to send data to the Hobbit server for processing without affecting the status display (the "data" messagetype) - a plugin-mechanism where hobbit "worker modules" can pick up these data and process them - a simple unix-pipe can be used to feed data into the normal graph handling module E.g. one way Jeff could get his real-time graph goes like this: - On the client, run a job every minute to grab the data and send it to the Hobbit server using $BB $BBDISP "data $MACHINE.xdata ... - On the Hobbit server, write a Hobbit module that grabs messages off the Hobbit "data" channel. This really just means reading messages from stdin - each message begins with a "@@data" line, and ends with a "@@" line. You can easily then pick out those messages that are "xdata" sent by the once-a-minute job. - This module then feeds the message into an RRD file. If it's one of the standard tests (e.g. disk), you can just change the "xdata" into "disk" and feed it into a child process running the normal hobbitd_rrd program. Start hobbitd_rrd for this purpose with a different BBRRDS setting, so the RRD files go into a separate directory (perhaps on a RAM disk, if you are really updating once a minute). What's missing then is to get the RRD file created in a way so that it will accept such frequent updates, and perhaps only store the last 1 or 2 hours of data. So you'll have to dig into the "rrdtool create" command to get the RRD file setup correctly, before you start feeding data into it. Regards, Henrik
participants (5)
-
gendalia@iastate.edu
-
henrik@hswn.dk
-
jeffnewman75@gmail.com
-
mailinglists@websitemanagers.com.au
-
scott@PacketPushers.com