Acknowledge issue continues with xymon 4.3.2
I have xymon 4.3.2 installed now
Every 4 days, almost exactly, I start losing the ability to acknowledge some alerts. As time progresses, it gets worse and worse – at first it's random, some can be acknowledged, some can't Then, more and more can not be acknowledged
New alerts, Existing alerts that were already acknowledged, it doesn't matter
This is a fairly impacting issue, and others on the list have said they have this same problem
All I have is that find_cookie in lib/rbt.c is not finding the cookie, despite it being visible in the hobbitdboard
2011-04-05 05:23:09 Cookie 115771 not found, dropping ack 2011-04-05 05:23:09 Cookie 54483 not found, dropping ack 2011-04-05 05:23:09 Cookie 47469 not found, dropping ack 2011-04-05 06:38:55 Cookie 86204 not found, dropping ack 2011-04-05 06:41:37 Cookie 86204 not found, dropping ack
This is what my logs start filling up with.
Can anyone on this list point me to at least some starting point to try and solve this? It's seriously impacting my xymon implementation
--
This E-mail and any of its attachments may contain Time Warner Cable proprietary information, which is privileged, confidential, or subject to copyright belonging to Time Warner Cable. This E-mail is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient of this E-mail, you are hereby notified that any dissemination, distribution, copying, or action taken in relation to the contents of and attachments to this E-mail is strictly prohibited and may be unlawful. If you have received this E-mail in error, please notify the sender immediately and permanently delete the original and any copy of this E-mail and any printout.
It's definitely some sort of "data in memory" corruption that occurs that is repeatable - I've noticed that when I restart when the problem first occurs, loading the chk file that is saved, it gets this message:
2011-04-05 08:55:32 Too few fields in record - found 6, expected 17 2011-04-05 08:55:32 Too few fields in record - found 6, expected 17 2011-04-05 08:55:32 Too few fields in record - found 6, expected 17 2011-04-05 08:55:32 Too few fields in record - found 6, expected 17 2011-04-05 08:55:32 Too few fields in record - found 6, expected 17 2011-04-05 08:55:32 Too few fields in record - found 6, expected 17 2011-04-05 08:55:32 Too few fields in record - found 6, expected 17 2011-04-05 08:55:32 Too few fields in record - found 6, expected 17
This matches up with the number of Cookies it couldn't find - I am guessing it's missing the cookies in those records
And more and more of those messages depending on how long I wait to restart (I.e. As the acknowledge problem gets wose and worse)
If I restart when I am not showing signs of it not finding cookies, I do not get that message in the xymonlaunch.log - it just works fine and exactly as I expect
Is there some sort of memory limit or that I am hitting? My xymond process takes up 524 MB of memory right now.
Just looking for any steps to take next
On 4/5/11 9:00 AM, "Clark, Sean" <sean.clark at twcable.com> wrote:
I have xymon 4.3.2 installed now
Every 4 days, almost exactly, I start losing the ability to acknowledge some alerts. As time progresses, it gets worse and worse at first it's random, some can be acknowledged, some can't Then, more and more can not be acknowledged
New alerts, Existing alerts that were already acknowledged, it doesn't matter
This is a fairly impacting issue, and others on the list have said they have this same problem
All I have is that find_cookie in lib/rbt.c is not finding the cookie, despite it being visible in the hobbitdboard
2011-04-05 05:23:09 Cookie 115771 not found, dropping ack 2011-04-05 05:23:09 Cookie 54483 not found, dropping ack 2011-04-05 05:23:09 Cookie 47469 not found, dropping ack 2011-04-05 06:38:55 Cookie 86204 not found, dropping ack 2011-04-05 06:41:37 Cookie 86204 not found, dropping ack
This is what my logs start filling up with.
Can anyone on this list point me to at least some starting point to try and solve this? It's seriously impacting my xymon implementation
--
This E-mail and any of its attachments may contain Time Warner Cable proprietary information, which is privileged, confidential, or subject to copyright belonging to Time Warner Cable. This E-mail is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient of this E-mail, you are hereby notified that any dissemination, distribution, copying, or action taken in relation to the contents of and attachments to this E-mail is strictly prohibited and may be unlawful. If you have received this E-mail in error, please notify the sender immediately and permanently delete the original and any copy of this E-mail and any printout.
Xymon mailing list Xymon at xymon.com http://lists.xymon.com/mailman/listinfo/xymon
This E-mail and any of its attachments may contain Time Warner Cable proprietary information, which is privileged, confidential, or subject to copyright belonging to Time Warner Cable. This E-mail is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient of this E-mail, you are hereby notified that any dissemination, distribution, copying, or action taken in relation to the contents of and attachments to this E-mail is strictly prohibited and may be unlawful. If you have received this E-mail in error, please notify the sender immediately and permanently delete the original and any copy of this E-mail and any printout.
I'm not expecting some sort of magic patch to fix this tomorrow, I am just looking for some direction to take
So far, I haven't even had an acknowledgement that anyone's read this, other than people who have the same problem as me, whose prescribed options are "fix the problem you can't ack faster."
I'll list things that I have changed from the stock xymon settings in the hopes that Henrik or someone else can say "if you change that, you need to change this or you will most likely have your shm and chk files corrupted"
In xymonserver.cfg
MAXMSG_STATUS="1036118" MAXMSG_CLIENT="1036118" MAXMSG_DATA="1036118" MAXMSG_NOTES="1036118"
MAXLINE="1036118"
In tasks.cfg:
History disabled Xymongen disabled
[all others are in their 'default' state, I.e. Proxy disabled, xymond enabled]
I have 78 rules in alerts.cfg spread across 8,565 hosts. I've added 14 graphing items in graphs.cfg
It's compile for i386 Linux
Previously the binaries were stripped because I installed them via the spec file from the developer's list.
I built the binaries and did a make install instead so they are no longer stripped
I do not get a core file for failing to acknowledge. Eventually no events can be acknowledged at all, and if it gets to that point, the only way to restart xymon is to remove the .chk files [it seems to tolerate 6-20 items corrupted, but hundreds it will fail to start]
I am just looking for guidance, or some thing to try - please let me know
On 4/5/11 9:00 AM, "Clark, Sean" <sean.clark at twcable.com> wrote:
I have xymon 4.3.2 installed now
Every 4 days, almost exactly, I start losing the ability to acknowledge some alerts. As time progresses, it gets worse and worse at first it's random, some can be acknowledged, some can't Then, more and more can not be acknowledged
New alerts, Existing alerts that were already acknowledged, it doesn't matter
This is a fairly impacting issue, and others on the list have said they have this same problem
All I have is that find_cookie in lib/rbt.c is not finding the cookie, despite it being visible in the hobbitdboard
2011-04-05 05:23:09 Cookie 115771 not found, dropping ack 2011-04-05 05:23:09 Cookie 54483 not found, dropping ack 2011-04-05 05:23:09 Cookie 47469 not found, dropping ack 2011-04-05 06:38:55 Cookie 86204 not found, dropping ack 2011-04-05 06:41:37 Cookie 86204 not found, dropping ack
This is what my logs start filling up with.
Can anyone on this list point me to at least some starting point to try and solve this? It's seriously impacting my xymon implementation
--
This E-mail and any of its attachments may contain Time Warner Cable proprietary information, which is privileged, confidential, or subject to copyright belonging to Time Warner Cable. This E-mail is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient of this E-mail, you are hereby notified that any dissemination, distribution, copying, or action taken in relation to the contents of and attachments to this E-mail is strictly prohibited and may be unlawful. If you have received this E-mail in error, please notify the sender immediately and permanently delete the original and any copy of this E-mail and any printout.
Xymon mailing list Xymon at xymon.com http://lists.xymon.com/mailman/listinfo/xymon
This E-mail and any of its attachments may contain Time Warner Cable proprietary information, which is privileged, confidential, or subject to copyright belonging to Time Warner Cable. This E-mail is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient of this E-mail, you are hereby notified that any dissemination, distribution, copying, or action taken in relation to the contents of and attachments to this E-mail is strictly prohibited and may be unlawful. If you have received this E-mail in error, please notify the sender immediately and permanently delete the original and any copy of this E-mail and any printout.
On Thu, 7 Apr 2011 09:16:01 -0400, "Clark, Sean" <sean.clark at twcable.com> wrote:
I'm not expecting some sort of magic patch to fix this tomorrow, I am just looking for some direction to take
So far, I haven't even had an acknowledgement that anyone's read this, other than people who have the same problem as me, whose prescribed options are "fix the problem you can't ack faster."
I've seen your messages, but haven't had a chance to dig into the code to see where the problem is.
I'll list things that I have changed from the stock xymon settings in the hopes that Henrik or someone else can say "if you change that, you need to change this or you will most likely have your shm and chk files corrupted"
I don't see any config changes you've done that would explain this.
Every 4 days, almost exactly, I start losing the ability to acknowledge some alerts. As time progresses, it gets worse and worse at first it's random, some can be acknowledged, some can't Then, more and more can not be acknowledged
2011-04-05 05:23:09 Cookie 115771 not found, dropping ack 2011-04-05 05:23:09 Cookie 54483 not found, dropping ack
How do you ack an event ? Are you using the "Acknowledge alert" webpage with the "--no-pin" option (default), or is it ack via email or ... ?
Regards, Henrik
But I can say, using the webpage default method produces the same error messages "Cookie not found" -- so I didn't think it would be my method of acknowledging
--
Sean Clark Sr. Engineer, Software ATG Network Operations & Planning Integrated Regional OSS <http://www.twcable.com/DepartmentOverview/AdvancedTechnologyGroup/ATG/NOP/ OSS/Network.aspx> sean.clark at twcable.com <mailto:sean.clark at twcable.com> devaudio <aim://devaudio> <mailto:sean.clark at twcable.com> Office: (315) 362-3973 cell: (315) 415-2816
On 4/7/11 9:43 AM, "Clark, Sean" <sean.clark at twcable.com> wrote:
$BB $BBPAGE \"xymondack $NUMBER $DELAY $MESSAGE\"";
From a script
Where $BB is
/sw/libexec/hobbit/client/bin]./bb --version Hobbit version 4.2.0
BBPAGE is my xymond display running 4.2.3
$NUMBER is the cookie, obtained by using that same hobbit client to run "hobbitdboard fields=hostname,testname,color,acktime,disabletime,cookie,ackmsg,dismsg,la s tchange"
$DELAY is typically 120, but setable
$MESSAGE is just text
--
Sean Clark Sr. Engineer, Software ATG Network Operations & Planning Integrated Regional OSS <http://www.twcable.com/DepartmentOverview/AdvancedTechnologyGroup/ATG/NOP / OSS/Network.aspx> sean.clark at twcable.com <mailto:sean.clark at twcable.com> devaudio <aim://devaudio> <mailto:sean.clark at twcable.com> Office: (315) 362-3973 cell: (315) 415-2816
On 4/7/11 9:36 AM, "henrik at hswn.dk" <henrik at hswn.dk> wrote:
On Thu, 7 Apr 2011 09:16:01 -0400, "Clark, Sean" <sean.clark at twcable.com> wrote:
I'm not expecting some sort of magic patch to fix this tomorrow, I am just looking for some direction to take
So far, I haven't even had an acknowledgement that anyone's read this, other than people who have the same problem as me, whose prescribed options are "fix the problem you can't ack faster."
I've seen your messages, but haven't had a chance to dig into the code to see where the problem is.
I'll list things that I have changed from the stock xymon settings in the hopes that Henrik or someone else can say "if you change that, you need to change this or you will most likely have your shm and chk files corrupted"
I don't see any config changes you've done that would explain this.
Every 4 days, almost exactly, I start losing the ability to acknowledge some alerts. As time progresses, it gets worse and worse at first it's random, some can be acknowledged, some can't Then, more and more can not be acknowledged
2011-04-05 05:23:09 Cookie 115771 not found, dropping ack 2011-04-05 05:23:09 Cookie 54483 not found, dropping ack
How do you ack an event ? Are you using the "Acknowledge alert" webpage with the "--no-pin" option (default), or is it ack via email or ... ?
Regards, Henrik
This E-mail and any of its attachments may contain Time Warner Cable proprietary information, which is privileged, confidential, or subject to copyright belonging to Time Warner Cable. This E-mail is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient of this E-mail, you are hereby notified that any dissemination, distribution, copying, or action taken in relation to the contents of and attachments to this E-mail is strictly prohibited and may be unlawful. If you have received this E-mail in error, please notify the sender immediately and permanently delete the original and any copy of this E-mail and any printout.
Xymon mailing list Xymon at xymon.com http://lists.xymon.com/mailman/listinfo/xymon
This E-mail and any of its attachments may contain Time Warner Cable proprietary information, which is privileged, confidential, or subject to copyright belonging to Time Warner Cable. This E-mail is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient of this E-mail, you are hereby notified that any dissemination, distribution, copying, or action taken in relation to the contents of and attachments to this E-mail is strictly prohibited and may be unlawful. If you have received this E-mail in error, please notify the sender immediately and permanently delete the original and any copy of this E-mail and any printout.
Den 07-04-2011 15:45, Clark, Sean skrev:
But I can say, using the webpage default method produces the same error messages "Cookie not found" -- so I didn't think it would be my method of acknowledging
Ok, that would have been my next question :-)
It is quite possible that it's a bug in the xymond code. I don't know why it hits you so much, but that is kind of irrelevant.
Inside xymond, the cookies are stored in a datastructure called a "red-black tree" ("rbtree" for short). This uses some code that I picked up from someone else - it is used in lots of places, e.g. all of the hosts.cfg configuration is also stored in a similar datastructure.
However, the cookie-handling is special because it cookies are frequently deleted (hosts being removed happens much less frequently). I have had some crashes that I could never really explain when hosts were removed, and I really do suspect that particular bit of code that deletes an entry from the rbtree to be buggy. Therefore, it could very well be that there is a real problem here.
I've come up with a version of xymond.c that eliminates the rbtree code for the cookies. It uses a much less efficient way of looking up the cookies - basically, it will scan through all of the status-log entries that xymond has in memory - but since this only happens when a cookie needs to be renewed, or when xymond receives an ack, it should not put too much extra load on your system. It would be very interesting to hear if this patch on top of 4.3.2 solves the issue; if it does, then I surely know that there is a bug in the rbtree "delete node" code.
Regards, Henrik
Thank you I will install this post haste.
Hope your surgery goes well, try not to look at bright lights for a while :-D
On 4/7/11 5:02 PM, "Henrik Størner" <henrik at hswn.dk> wrote:
Den 07-04-2011 15:45, Clark, Sean skrev:
But I can say, using the webpage default method produces the same error messages "Cookie not found" -- so I didn't think it would be my method of acknowledging
Ok, that would have been my next question :-)
It is quite possible that it's a bug in the xymond code. I don't know why it hits you so much, but that is kind of irrelevant.
Inside xymond, the cookies are stored in a datastructure called a "red-black tree" ("rbtree" for short). This uses some code that I picked up from someone else - it is used in lots of places, e.g. all of the hosts.cfg configuration is also stored in a similar datastructure.
However, the cookie-handling is special because it cookies are frequently deleted (hosts being removed happens much less frequently). I have had some crashes that I could never really explain when hosts were removed, and I really do suspect that particular bit of code that deletes an entry from the rbtree to be buggy. Therefore, it could very well be that there is a real problem here.
I've come up with a version of xymond.c that eliminates the rbtree code for the cookies. It uses a much less efficient way of looking up the cookies - basically, it will scan through all of the status-log entries that xymond has in memory - but since this only happens when a cookie needs to be renewed, or when xymond receives an ack, it should not put too much extra load on your system. It would be very interesting to hear if this patch on top of 4.3.2 solves the issue; if it does, then I surely know that there is a bug in the rbtree "delete node" code.
Regards, Henrik
This E-mail and any of its attachments may contain Time Warner Cable proprietary information, which is privileged, confidential, or subject to copyright belonging to Time Warner Cable. This E-mail is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient of this E-mail, you are hereby notified that any dissemination, distribution, copying, or action taken in relation to the contents of and attachments to this E-mail is strictly prohibited and may be unlawful. If you have received this E-mail in error, please notify the sender immediately and permanently delete the original and any copy of this E-mail and any printout.
Just so anyone else following this thread is aware, the diff is for the trunk version, not the 4.3.2 release, although you could prolly figure it out for the 4.3.2 if you were so inclined
On 4/8/11 9:14 AM, "Clark, Sean" <sean.clark at twcable.com> wrote:
Thank you I will install this post haste.
Hope your surgery goes well, try not to look at bright lights for a while :-D
On 4/7/11 5:02 PM, "Henrik Størner" <henrik at hswn.dk> wrote:
Den 07-04-2011 15:45, Clark, Sean skrev:
But I can say, using the webpage default method produces the same error messages "Cookie not found" -- so I didn't think it would be my method of acknowledging
Ok, that would have been my next question :-)
It is quite possible that it's a bug in the xymond code. I don't know why it hits you so much, but that is kind of irrelevant.
Inside xymond, the cookies are stored in a datastructure called a "red-black tree" ("rbtree" for short). This uses some code that I picked up from someone else - it is used in lots of places, e.g. all of the hosts.cfg configuration is also stored in a similar datastructure.
However, the cookie-handling is special because it cookies are frequently deleted (hosts being removed happens much less frequently). I have had some crashes that I could never really explain when hosts were removed, and I really do suspect that particular bit of code that deletes an entry from the rbtree to be buggy. Therefore, it could very well be that there is a real problem here.
I've come up with a version of xymond.c that eliminates the rbtree code for the cookies. It uses a much less efficient way of looking up the cookies - basically, it will scan through all of the status-log entries that xymond has in memory - but since this only happens when a cookie needs to be renewed, or when xymond receives an ack, it should not put too much extra load on your system. It would be very interesting to hear if this patch on top of 4.3.2 solves the issue; if it does, then I surely know that there is a bug in the rbtree "delete node" code.
Regards, Henrik
This E-mail and any of its attachments may contain Time Warner Cable proprietary information, which is privileged, confidential, or subject to copyright belonging to Time Warner Cable. This E-mail is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient of this E-mail, you are hereby notified that any dissemination, distribution, copying, or action taken in relation to the contents of and attachments to this E-mail is strictly prohibited and may be unlawful. If you have received this E-mail in error, please notify the sender immediately and permanently delete the original and any copy of this E-mail and any printout.
Xymon mailing list Xymon at xymon.com http://lists.xymon.com/mailman/listinfo/xymon
This E-mail and any of its attachments may contain Time Warner Cable proprietary information, which is privileged, confidential, or subject to copyright belonging to Time Warner Cable. This E-mail is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient of this E-mail, you are hereby notified that any dissemination, distribution, copying, or action taken in relation to the contents of and attachments to this E-mail is strictly prohibited and may be unlawful. If you have received this E-mail in error, please notify the sender immediately and permanently delete the original and any copy of this E-mail and any printout.
participants (2)
-
henrik@hswn.dk
-
sean.clark@twcable.com