I try to identify filesystem "space hogs" via custom scripts I wrote a long time ago when using BB. 99% of my custom stuff is done in PERL.
I use 'du -k' to get the size of all directories in the filesystem. I then cut those results down to only the first and second level directories (but you could go as deep as you want). I store the size of each subdirectory in a small "database". I did this ages ago and my code uses PERL's "Storable" module to store the accumulated date into a file (called my "database"). These days I'd just use Hobbit's easily accessed RRD files. I then use PERL's Statistics::Descriptive::least_squares_fit() to calculate the slope and linear correlation coefficient of the "best fit line". This allows me to see how fast each subdirectory is growing/shrinking, and how linear that growth/reduction is. I trigger yellow/red conditions based on rate of growth and predicted fill time at current growth rate, in addition to the standard "95% full = red" test.
The above makes it fairly easy to identify which subdirectory is your problem, which is often times good enough to identify the file/process that is killing you. When that's not, I have a seperate test that tries to identify problem files a different way. BB/Hobbit uses 'top' to identify cpu-hogging processes. Many times you see files hogging space are directly tied to processes hogging cpu (runaway process = runaway file in many cases). 'top' identifies the process(es), then "lsof -p <pid>" is used to identify the files that the suspect process has open. Finding a cpu-hogger that has a filespace-hogger open is usually the holy grail you seek.
As a "repair" action for Hobbit, I squirreled away 2Gb of diskspace in 100Mb chunks for critical filesystems. "dd if=/dev/zero of=/filesystem/DiskSpaceReserve/reserve01 bs=1024 count=102400", then "cp reserve01 reserve02", etc. to build up the reserve. A seperate Hobbit "notification script" is used to simply delete files from this reserve under dire circumstances, after normal email/pager notifications have failed to trigger action by developers/production support people.
My BB/Hobbit custom scripts tend to get quite involved. Probably too much so, but they're fun for me to write!
From: Gary Baluha [mailto:gumby3203 at gmail.com] Sent: Monday, August 06, 2007 7:29 AM To: hobbit at hswn.dk Subject: Re: [hobbit] Highlights of the 4.3.0 version
< ... snip ... >
One example is disk space. A full filesystem would shut many things down. Apps should not fill a filesystem, but sometimes they do. So my custom Hobbit scripts first scream and scream about low disk space, even analysing things down to specific subdirectories and fast growing files and doing trend analysis. But if their call is not answered, they start freeing up space from a "private reserve" I have set aside to deal with emergencies. So if we experience a sudden unexpected blowup in a filesystem at 3am, Hobbit keeps things running in production until the appropriate people can look into and diagnose the problem. This may not be Utopian behavior, but it sure is practical at 3am in the morning!
What sort of trend analysis do your scripts perform? We have a few boxes that are notorious for filling up their disk space, and I haven't yet come up with an idea of how to neatly track exactly what it is that keeps filling up the disk.
< ... snip ...>