Peeps
I have both Solaris and Linux servers where a large or long-running process causes PROC matching to fail. Here are some examples:
7701 1 root Feb 28 S 24 0.0 00:00:00 0.0 572 2692 /sbin/agetty -L 9600 ttyS0 vt102 7702 1 root Feb 28 S 23 0.0 00:00:00 0.0 576 2692 /sbin/agetty -L 9600 ttyS1 vt102 7704 1 named Feb 28 S 18 2.4 1-08:59:39 4.4 270500 412784 /usr/sbin/named -u named -f 26498 3293 root 12:47:46 S 14 0.0 00:00:00 0.0 468 2676 sleep 180
This is on Linux. Note the longer-than-a-day TIME column that pushes the columns after it to the right.
The following is on Solaris 9:
11201 11199 n101649 12:38:54 S 59 0.0 0:00 0.0 1000 1144 vmstat 300 2 11202 1 n101649 12:38:54 S 59 0.0 0:00 0.0 968 1104 sh -c iostat -dxsrP 300 2 1>/tmp/xymon_iostatdisk.redacted 3244 2965 root Feb_16 S 59 0.0 5:20 0.1 7104 18736 /opt/OV/lbin/eaagt/opcle -std 3245 2965 root Feb_16 S 59 0.0 1:18 0.1 6376 20960 /opt/OV/lbin/eaagt/opcmona 3253 1 root Feb_16 S 59 0.9 1-10:46:45 0.8 58168 59632 /usr/local/sbin/named -f
Solaris "ps" output allows more characters for TIME than Linux. However in this case the memory columns (RSS and VSZ) are larger than expected, pushing a couple of digits over into the process name area.
It seems that Xymon is parsing these based on fixed column sizes, defined for each OS. The result of these particular examples is that Xymon fails to match on the process name. Instead, I need to use match strings like so:
PROC "%^(\d* |^)/usr/local/sbin/named(\s*$|\s)" 1 1"TEXT=/usr/local/sbin/named"
or
PROC "%^(\d* |^)/usr/sbin/named(\s*$|\s)" 1 1"TEXT=/usr/sbin/named"
I guess this email is part "am I doing something wrong", part "does anyone have a better idea", and part feature request (for more awk-like positional matching).
Cheers Jeremy
Close... AFAIK, it's actually looking for the proper column name in a given listing (from the first line), and then keying off of that. When the columns don't line up with the given header, xymond_client examines the wrong substring. So it's dynamic and static :)
see: xymon-4.3.7/xymond/client/solaris.c:66
unix_procs_report(hostname, clienttype, os, hinfo, fromline, timestr,
"CMD", "COMMAND", psstr);
and xymond_client.c:958 onward
On the SunOS box I've got, the ps command (xymonclient-sunos.sh) is providing the following field list, and below that is some of the output it gets wrong.
I suppose one quick fix if you're getting this a lot might be to manually change the order of the fields in the client script to "ps -A -o args,pid,ppid,user,stime,s,pri,pcpu,time,pmem,rss,vsz"
I'm not sure if other processing is going on, but the only drawback might be slightly odd-looking ps output in your client logs.
HTH, -jc
-bash-3.2$ ps -A -o pid,ppid,user,stime,s,pri,pcpu,time,pmem,rss,vsz,args | head -1 PID PPID USER STIME S PRI %CPU TIME %MEM RSS VSZ COMMAND -bash-3.2$ ps -A -o pid,ppid,user,stime,s,pri,pcpu,time,pmem,rss,vsz,args | sort -k8 -r | head 693 666 root Mar_28 S 59 0.1 1-12:12:45 0.0 2720 3424 dovecot-auth -w 160 1 root Mar_28 S 59 0.2 1-04:06:09 0.1 51552 78848 /usr/sbin/nscd 3 0 root Mar_28 S 60 0.1 15:09:32 0.0 0 0 fsflush 482 1 root Mar_28 S 59 0.0 10:10:31 0.1 94008 101432 /opt/local/bin/python /usr/local/sbin/denyhosts.py --daemon --config=/usr/share 6 0 root Mar_28 S 0 0.1 08:18:46 0.0 0 0 vmtasks 461 1 root Mar_28 S 60 0.0 04:19:34 0.0 1744 3048 /usr/lib/nfs/nfsd 95 0 root Mar_28 S 99 0.0 04:02:11 0.0 0 0 zpool-pool 746 666 dovecot Mar_28 S 59 0.0 02:33:29 0.0 11376 12584 pop3-login 1183 666 root Mar_28 S 59 0.0 02:25:22 0.0 2736 3424 dovecot-auth -w 666 1 root Mar_28 S 59 0.0 02:05:38 0.0 2304 3464 /usr/local/sbin/dovecot