Spurious purple messages
Hi all
Since Friday September 4, I've started receiving "stopped reporting (PURPLE)" messages for all tests on all hosts from one of our Xymon servers.
The host status, as shown in the Main View, is green for all hosts and tests. No purple at all.
The "stopped reporting (PURPLE)" messages are being sent at the same time every day, 1:45PM.
Any advise on how I should track this down?
Thanks
Sounds like network interface overload, perhaps a backup or massive file transfer etc,,,
-----Original Message----- From: Xymon [mailto:xymon-bounces at xymon.com] On Behalf Of Colin Coe Sent: 08 September 2015 02:40 To: xymon at xymon.com Subject: [Xymon] Spurious purple messages
Hi all
Since Friday September 4, I've started receiving "stopped reporting (PURPLE)" messages for all tests on all hosts from one of our Xymon servers.
The host status, as shown in the Main View, is green for all hosts and tests. No purple at all.
The "stopped reporting (PURPLE)" messages are being sent at the same time every day, 1:45PM.
Any advise on how I should track this down?
Thanks
Xymon mailing list Xymon at xymon.com http://lists.xymon.com/mailman/listinfo/xymon
This message and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this message in error please delete it and any files transmitted with it, after notifying postmaster at uk.mizuho-sc.com.
Any opinions expressed in this message may be those of the author and not necessarily those of the company. The company accepts no responsibility for the accuracy or completeness of any information contained herein. This message is not intended to create legal relations between the company and the recipient.
Recipients should please note that messages sent via the Internet may be intercepted and that caution should therefore be exercised before dispatching to the company any confidential or sensitive information. Mizuho International plc Bracken House, One Friday Street, London EC4M 9JA. TEL. 020 72361090. Wholly owned subsidiary of Mizuho Securities Co., Ltd. Member of Mizuho Financial Group. Authorised by the Prudential Regulation Authority and regulated by the Financial Conduct Authority and the Prudential Regulation Authority. Member of the London Stock Exchange.
Registered in England No. 1203696. Registered office as above.
Hi Colin
What do the client hosts share in common? I have seen in the past, a client was overloading their storage system, and were overflowing buffers and exceeding the storage array's ability to process IO requests. Of course this caused a general disk latency, which slowed things down to the point of a purple flood. Was no simple solution to that one, except buy more storage, which they did.
Also, check the "serial numbers" on the messages. Is this a repeat of an older message - in which case Xymon might have something fishy going on, or are they new messages every day, as in it really thinks there is a problem.
Xymon only updates pages every 2 and 5 minutes, depending on the page you are looking at. Meaning you could wait up to 7 minutes for the real status to appear. A purple takes 30 minutes to trigger. With some unfortunate, and highly improbable timing on whatever is triggering these events, it's possible you might not see the purple. Have you pulled up a "snapshot report" for the exact time of the messages?
Something else unlikely, but possible, is the network. The conn test used ping, which is UDP The Xymon agent sends using TCP. Is there anything interesting happening on the network at the time?
Regards Vernon
On 8 September 2015 at 11:39, Colin Coe <colin.coe at gmail.com> wrote:
Hi all
Since Friday September 4, I've started receiving "stopped reporting (PURPLE)" messages for all tests on all hosts from one of our Xymon servers.
The host status, as shown in the Main View, is green for all hosts and tests. No purple at all.
The "stopped reporting (PURPLE)" messages are being sent at the same time every day, 1:45PM.
Any advise on how I should track this down?
Thanks
Xymon mailing list Xymon at xymon.com http://lists.xymon.com/mailman/listinfo/xymon
-- "Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
Hi Vernon
Thanks for the really good info. The message serial numbers are different every day but the messages are sent at the same time (13:45) daily for all tests on all hosts.
The network is not congested nor is the SAN under any kind of pressure.
Interestingly, trying to do the snapshot report gave me "Cannot create output directory".
Thanks again
CC
On Tue, Sep 8, 2015 at 3:56 PM, Vernon Everett <everett.vernon at gmail.com> wrote:
Hi Colin
What do the client hosts share in common? I have seen in the past, a client was overloading their storage system, and were overflowing buffers and exceeding the storage array's ability to process IO requests. Of course this caused a general disk latency, which slowed things down to the point of a purple flood. Was no simple solution to that one, except buy more storage, which they did.
Also, check the "serial numbers" on the messages. Is this a repeat of an older message - in which case Xymon might have something fishy going on, or are they new messages every day, as in it really thinks there is a problem.
Xymon only updates pages every 2 and 5 minutes, depending on the page you are looking at. Meaning you could wait up to 7 minutes for the real status to appear. A purple takes 30 minutes to trigger. With some unfortunate, and highly improbable timing on whatever is triggering these events, it's possible you might not see the purple. Have you pulled up a "snapshot report" for the exact time of the messages?
Something else unlikely, but possible, is the network. The conn test used ping, which is UDP The Xymon agent sends using TCP. Is there anything interesting happening on the network at the time?
Regards Vernon
On 8 September 2015 at 11:39, Colin Coe <colin.coe at gmail.com> wrote:
Hi all
Since Friday September 4, I've started receiving "stopped reporting (PURPLE)" messages for all tests on all hosts from one of our Xymon servers.
The host status, as shown in the Main View, is green for all hosts and tests. No purple at all.
The "stopped reporting (PURPLE)" messages are being sent at the same time every day, 1:45PM.
Any advise on how I should track this down?
Thanks
Xymon mailing list Xymon at xymon.com http://lists.xymon.com/mailman/listinfo/xymon
-- "Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
That might be a permissions thing.
On 8 September 2015 at 19:15, Colin Coe <colin.coe at gmail.com> wrote:
Hi Vernon
Thanks for the really good info. The message serial numbers are different every day but the messages are sent at the same time (13:45) daily for all tests on all hosts.
The network is not congested nor is the SAN under any kind of pressure.
Interestingly, trying to do the snapshot report gave me "Cannot create output directory".
Thanks again
CC
On Tue, Sep 8, 2015 at 3:56 PM, Vernon Everett <everett.vernon at gmail.com> wrote:
Hi Colin
What do the client hosts share in common? I have seen in the past, a client was overloading their storage system, and were overflowing buffers and exceeding the storage array's ability to process IO requests. Of course this caused a general disk latency, which slowed things down to the point of a purple flood. Was no simple solution to that one, except buy more storage, which they did.
Also, check the "serial numbers" on the messages. Is this a repeat of an older message - in which case Xymon might have something fishy going on, or are they new messages every day, as in it really thinks there is a problem.
Xymon only updates pages every 2 and 5 minutes, depending on the page you are looking at. Meaning you could wait up to 7 minutes for the real status to appear. A purple takes 30 minutes to trigger. With some unfortunate, and highly improbable timing on whatever is triggering these events, it's possible you might not see the purple. Have you pulled up a "snapshot report" for the exact time of the messages?
Something else unlikely, but possible, is the network. The conn test used ping, which is UDP The Xymon agent sends using TCP. Is there anything interesting happening on the network at the time?
Regards Vernon
On 8 September 2015 at 11:39, Colin Coe <colin.coe at gmail.com> wrote:
Hi all
Since Friday September 4, I've started receiving "stopped reporting (PURPLE)" messages for all tests on all hosts from one of our Xymon servers.
The host status, as shown in the Main View, is green for all hosts and tests. No purple at all.
The "stopped reporting (PURPLE)" messages are being sent at the same time every day, 1:45PM.
Any advise on how I should track this down?
Thanks
Xymon mailing list Xymon at xymon.com http://lists.xymon.com/mailman/listinfo/xymon
-- "Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
-- "Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
Almost...
Turned out to be SELinux, my old nemesis. :)
On Tue, Sep 8, 2015 at 5:37 PM, Vernon Everett <everett.vernon at gmail.com> wrote:
That might be a permissions thing.
On 8 September 2015 at 19:15, Colin Coe <colin.coe at gmail.com> wrote:
Hi Vernon
Thanks for the really good info. The message serial numbers are different every day but the messages are sent at the same time (13:45) daily for all tests on all hosts.
The network is not congested nor is the SAN under any kind of pressure.
Interestingly, trying to do the snapshot report gave me "Cannot create output directory".
Thanks again
CC
On Tue, Sep 8, 2015 at 3:56 PM, Vernon Everett <everett.vernon at gmail.com> wrote:
Hi Colin
What do the client hosts share in common? I have seen in the past, a client was overloading their storage system, and were overflowing buffers and exceeding the storage array's ability to process IO requests. Of course this caused a general disk latency, which slowed things down to the point of a purple flood. Was no simple solution to that one, except buy more storage, which they did.
Also, check the "serial numbers" on the messages. Is this a repeat of an older message - in which case Xymon might have something fishy going on, or are they new messages every day, as in it really thinks there is a problem.
Xymon only updates pages every 2 and 5 minutes, depending on the page you are looking at. Meaning you could wait up to 7 minutes for the real status to appear. A purple takes 30 minutes to trigger. With some unfortunate, and highly improbable timing on whatever is triggering these events, it's possible you might not see the purple. Have you pulled up a "snapshot report" for the exact time of the messages?
Something else unlikely, but possible, is the network. The conn test used ping, which is UDP The Xymon agent sends using TCP. Is there anything interesting happening on the network at the time?
Regards Vernon
On 8 September 2015 at 11:39, Colin Coe <colin.coe at gmail.com> wrote:
Hi all
Since Friday September 4, I've started receiving "stopped reporting (PURPLE)" messages for all tests on all hosts from one of our Xymon servers.
The host status, as shown in the Main View, is green for all hosts and tests. No purple at all.
The "stopped reporting (PURPLE)" messages are being sent at the same time every day, 1:45PM.
Any advise on how I should track this down?
Thanks
Xymon mailing list Xymon at xymon.com http://lists.xymon.com/mailman/listinfo/xymon
-- "Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
-- "Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
Good to know it's not just me that fights with SELinux. :-)
Now that it works, what does the snapshot report reveal at the time the purple alerts go out?
Purples require a "no report" for 30 minutes to trigger. You might want to check all your logs at around 30-35 minutes before the emails.
On 11 September 2015 at 18:13, Colin Coe <colin.coe at gmail.com> wrote:
Almost...
Turned out to be SELinux, my old nemesis. :)
On Tue, Sep 8, 2015 at 5:37 PM, Vernon Everett <everett.vernon at gmail.com> wrote:
That might be a permissions thing.
On 8 September 2015 at 19:15, Colin Coe <colin.coe at gmail.com> wrote:
Hi Vernon
Thanks for the really good info. The message serial numbers are different every day but the messages are sent at the same time (13:45) daily for all tests on all hosts.
The network is not congested nor is the SAN under any kind of pressure.
Interestingly, trying to do the snapshot report gave me "Cannot create output directory".
Thanks again
CC
On Tue, Sep 8, 2015 at 3:56 PM, Vernon Everett <
everett.vernon at gmail.com>
wrote:
Hi Colin
What do the client hosts share in common? I have seen in the past, a client was overloading their storage system, and were overflowing buffers and exceeding the storage array's ability to process IO requests. Of course this caused a general disk latency, which slowed things down to the point of a purple flood. Was no simple solution to that one, except buy more storage, which they did.
Also, check the "serial numbers" on the messages. Is this a repeat of an older message - in which case Xymon might have something fishy going on, or are they new messages every day, as in it really thinks there is a problem.
Xymon only updates pages every 2 and 5 minutes, depending on the page you are looking at. Meaning you could wait up to 7 minutes for the real status to appear. A purple takes 30 minutes to trigger. With some unfortunate, and highly improbable timing on whatever is triggering these events, it's possible you might not see the purple. Have you pulled up a "snapshot report" for the exact time of the messages?
Something else unlikely, but possible, is the network. The conn test used ping, which is UDP The Xymon agent sends using TCP. Is there anything interesting happening on the network at the time?
Regards Vernon
On 8 September 2015 at 11:39, Colin Coe <colin.coe at gmail.com> wrote:
Hi all
Since Friday September 4, I've started receiving "stopped reporting (PURPLE)" messages for all tests on all hosts from one of our Xymon servers.
The host status, as shown in the Main View, is green for all hosts
and
tests. No purple at all.
The "stopped reporting (PURPLE)" messages are being sent at the same time every day, 1:45PM.
Any advise on how I should track this down?
Thanks
Xymon mailing list Xymon at xymon.com http://lists.xymon.com/mailman/listinfo/xymon
-- "Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
-- "Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
-- "Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
OK, looking at this again. The main view looks fine, but the 'conn' test on every host is a yellow circle with a question mark (unknown) in the snapshot report view since September 4, 2015 at 13:32:42.
September 4, 2015 at 13:32:41 and earlier look fine.
Thanks
On Sat, Sep 12, 2015 at 5:48 PM, Vernon Everett <everett.vernon at gmail.com> wrote:
Good to know it's not just me that fights with SELinux. :-)
Now that it works, what does the snapshot report reveal at the time the purple alerts go out?
Purples require a "no report" for 30 minutes to trigger. You might want to check all your logs at around 30-35 minutes before the emails.
On 11 September 2015 at 18:13, Colin Coe <colin.coe at gmail.com> wrote:
Almost...
Turned out to be SELinux, my old nemesis. :)
On Tue, Sep 8, 2015 at 5:37 PM, Vernon Everett <everett.vernon at gmail.com> wrote:
That might be a permissions thing.
On 8 September 2015 at 19:15, Colin Coe <colin.coe at gmail.com> wrote:
Hi Vernon
Thanks for the really good info. The message serial numbers are different every day but the messages are sent at the same time (13:45) daily for all tests on all hosts.
The network is not congested nor is the SAN under any kind of pressure.
Interestingly, trying to do the snapshot report gave me "Cannot create output directory".
Thanks again
CC
On Tue, Sep 8, 2015 at 3:56 PM, Vernon Everett <everett.vernon at gmail.com> wrote:
Hi Colin
What do the client hosts share in common? I have seen in the past, a client was overloading their storage system, and were overflowing buffers and exceeding the storage array's ability to process IO requests. Of course this caused a general disk latency, which slowed things down to the point of a purple flood. Was no simple solution to that one, except buy more storage, which they did.
Also, check the "serial numbers" on the messages. Is this a repeat of an older message - in which case Xymon might have something fishy going on, or are they new messages every day, as in it really thinks there is a problem.
Xymon only updates pages every 2 and 5 minutes, depending on the page you are looking at. Meaning you could wait up to 7 minutes for the real status to appear. A purple takes 30 minutes to trigger. With some unfortunate, and highly improbable timing on whatever is triggering these events, it's possible you might not see the purple. Have you pulled up a "snapshot report" for the exact time of the messages?
Something else unlikely, but possible, is the network. The conn test used ping, which is UDP The Xymon agent sends using TCP. Is there anything interesting happening on the network at the time?
Regards Vernon
On 8 September 2015 at 11:39, Colin Coe <colin.coe at gmail.com> wrote:
Hi all
Since Friday September 4, I've started receiving "stopped reporting (PURPLE)" messages for all tests on all hosts from one of our Xymon servers.
The host status, as shown in the Main View, is green for all hosts and tests. No purple at all.
The "stopped reporting (PURPLE)" messages are being sent at the same time every day, 1:45PM.
Any advise on how I should track this down?
Thanks
Xymon mailing list Xymon at xymon.com http://lists.xymon.com/mailman/listinfo/xymon
-- "Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
-- "Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
-- "Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
That's interesting. No idea what it means, or where to go from here, but it's certainly interesting.
Does it happen the exact same time every day? Have you tried a ping from the Xymon host to the client at or around the time of the issue? See if there's any oddities?
Is there anything in the logs?
On 14 September 2015 at 15:17, Colin Coe <colin.coe at gmail.com> wrote:
OK, looking at this again. The main view looks fine, but the 'conn' test on every host is a yellow circle with a question mark (unknown) in the snapshot report view since September 4, 2015 at 13:32:42.
September 4, 2015 at 13:32:41 and earlier look fine.
Thanks
On Sat, Sep 12, 2015 at 5:48 PM, Vernon Everett <everett.vernon at gmail.com> wrote:
Good to know it's not just me that fights with SELinux. :-)
Now that it works, what does the snapshot report reveal at the time the purple alerts go out?
Purples require a "no report" for 30 minutes to trigger. You might want to check all your logs at around 30-35 minutes before the emails.
On 11 September 2015 at 18:13, Colin Coe <colin.coe at gmail.com> wrote:
Almost...
Turned out to be SELinux, my old nemesis. :)
On Tue, Sep 8, 2015 at 5:37 PM, Vernon Everett <
everett.vernon at gmail.com>
wrote:
That might be a permissions thing.
On 8 September 2015 at 19:15, Colin Coe <colin.coe at gmail.com> wrote:
Hi Vernon
Thanks for the really good info. The message serial numbers are different every day but the messages are sent at the same time
(13:45)
daily for all tests on all hosts.
The network is not congested nor is the SAN under any kind of pressure.
Interestingly, trying to do the snapshot report gave me "Cannot create output directory".
Thanks again
CC
On Tue, Sep 8, 2015 at 3:56 PM, Vernon Everett <everett.vernon at gmail.com> wrote:
Hi Colin
What do the client hosts share in common? I have seen in the past, a client was overloading their storage system, and were overflowing buffers and exceeding the storage array's ability to process IO requests. Of course this caused a general disk latency, which slowed things down to the point of a purple flood. Was no simple solution to that one, except buy more storage, which they did.
Also, check the "serial numbers" on the messages. Is this a repeat of an older message - in which case Xymon might have something fishy going on, or are they new messages every day, as in it really thinks there is a problem.
Xymon only updates pages every 2 and 5 minutes, depending on the page you are looking at. Meaning you could wait up to 7 minutes for the real status to appear. A purple takes 30 minutes to trigger. With some unfortunate, and highly improbable timing on whatever is triggering these events, it's possible you might not see the purple. Have you pulled up a "snapshot report" for the exact time of the messages?
Something else unlikely, but possible, is the network. The conn test used ping, which is UDP The Xymon agent sends using TCP. Is there anything interesting happening on the network at the time?
Regards Vernon
On 8 September 2015 at 11:39, Colin Coe <colin.coe at gmail.com> wrote: > > Hi all > > Since Friday September 4, I've started receiving "stopped reporting > (PURPLE)" messages for all tests on all hosts from one of our Xymon > servers. > > The host status, as shown in the Main View, is green for all hosts > and > tests. No purple at all. > > The "stopped reporting (PURPLE)" messages are being sent at the same > time every day, 1:45PM. > > Any advise on how I should track this down? > > Thanks > _______________________________________________ > Xymon mailing list > Xymon at xymon.com > http://lists.xymon.com/mailman/listinfo/xymon
-- "Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
-- "Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
-- "Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
-- "Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
Hi Vernon,
Yep, very interesting. The purple messages come through every day at about the same time, give or take a minute or so.
Yep, pings work and the normal "main view" and "all non-green view" works fine.
The logs look fine. I'd really like to get to the bottom of this...
Thanks
CC
On Tue, Sep 15, 2015 at 10:06 AM, Vernon Everett <everett.vernon at gmail.com> wrote:
That's interesting. No idea what it means, or where to go from here, but it's certainly interesting.
Does it happen the exact same time every day? Have you tried a ping from the Xymon host to the client at or around the time of the issue? See if there's any oddities?
Is there anything in the logs?
On 14 September 2015 at 15:17, Colin Coe <colin.coe at gmail.com> wrote:
OK, looking at this again. The main view looks fine, but the 'conn' test on every host is a yellow circle with a question mark (unknown) in the snapshot report view since September 4, 2015 at 13:32:42.
September 4, 2015 at 13:32:41 and earlier look fine.
Thanks
On Sat, Sep 12, 2015 at 5:48 PM, Vernon Everett <everett.vernon at gmail.com> wrote:
Good to know it's not just me that fights with SELinux. :-)
Now that it works, what does the snapshot report reveal at the time the purple alerts go out?
Purples require a "no report" for 30 minutes to trigger. You might want to check all your logs at around 30-35 minutes before the emails.
On 11 September 2015 at 18:13, Colin Coe <colin.coe at gmail.com> wrote:
Almost...
Turned out to be SELinux, my old nemesis. :)
On Tue, Sep 8, 2015 at 5:37 PM, Vernon Everett <everett.vernon at gmail.com> wrote:
That might be a permissions thing.
On 8 September 2015 at 19:15, Colin Coe <colin.coe at gmail.com> wrote:
Hi Vernon
Thanks for the really good info. The message serial numbers are different every day but the messages are sent at the same time (13:45) daily for all tests on all hosts.
The network is not congested nor is the SAN under any kind of pressure.
Interestingly, trying to do the snapshot report gave me "Cannot create output directory".
Thanks again
CC
On Tue, Sep 8, 2015 at 3:56 PM, Vernon Everett <everett.vernon at gmail.com> wrote: > Hi Colin > > What do the client hosts share in common? > I have seen in the past, a client was overloading their storage > system, > and > were overflowing buffers and exceeding the storage array's ability > to > process IO requests. Of course this caused a general disk latency, > which > slowed things down to the point of a purple flood. > Was no simple solution to that one, except buy more storage, which > they > did. > > Also, check the "serial numbers" on the messages. Is this a repeat > of > an > older message - in which case Xymon might have something fishy > going > on, > or > are they new messages every day, as in it really thinks there is a > problem. > > Xymon only updates pages every 2 and 5 minutes, depending on the > page > you > are looking at. Meaning you could wait up to 7 minutes for the > real > status > to appear. > A purple takes 30 minutes to trigger. > With some unfortunate, and highly improbable timing on whatever is > triggering these events, it's possible you might not see the > purple. > Have you pulled up a "snapshot report" for the exact time of the > messages? > > Something else unlikely, but possible, is the network. > The conn test used ping, which is UDP > The Xymon agent sends using TCP. > Is there anything interesting happening on the network at the > time? > > Regards > Vernon > > > > On 8 September 2015 at 11:39, Colin Coe <colin.coe at gmail.com> > wrote: >> >> Hi all >> >> Since Friday September 4, I've started receiving "stopped >> reporting >> (PURPLE)" messages for all tests on all hosts from one of our >> Xymon >> servers. >> >> The host status, as shown in the Main View, is green for all >> hosts >> and >> tests. No purple at all. >> >> The "stopped reporting (PURPLE)" messages are being sent at the >> same >> time every day, 1:45PM. >> >> Any advise on how I should track this down? >> >> Thanks >> _______________________________________________ >> Xymon mailing list >> Xymon at xymon.com >> http://lists.xymon.com/mailman/listinfo/xymon > > > > > -- > "Accept the challenges so that you can feel the exhilaration of > victory" > - General George Patton
-- "Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
-- "Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
-- "Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
Could it be something with the clock on the xymon server? Maybe some cron process to synchronize to a time server?
-----Original Message----- From: Xymon [mailto:xymon-bounces at xymon.com] On Behalf Of Colin Coe Sent: Monday, September 14, 2015 22:29 To: Vernon Everett Cc: xymon at xymon.com Subject: Re: [Xymon] Spurious purple messages
Hi Vernon,
Yep, very interesting. The purple messages come through every day at about the same time, give or take a minute or so.
Yep, pings work and the normal "main view" and "all non-green view" works fine.
The logs look fine. I'd really like to get to the bottom of this...
Thanks
CC
On Tue, Sep 15, 2015 at 10:06 AM, Vernon Everett <everett.vernon at gmail.com> wrote:
That's interesting. No idea what it means, or where to go from here, but it's certainly interesting.
Does it happen the exact same time every day? Have you tried a ping from the Xymon host to the client at or around the time of the issue? See if there's any oddities?
Is there anything in the logs?
On 14 September 2015 at 15:17, Colin Coe <colin.coe at gmail.com> wrote:
OK, looking at this again. The main view looks fine, but the 'conn' test on every host is a yellow circle with a question mark (unknown) in the snapshot report view since September 4, 2015 at 13:32:42.
September 4, 2015 at 13:32:41 and earlier look fine.
Thanks
On Sat, Sep 12, 2015 at 5:48 PM, Vernon Everett <everett.vernon at gmail.com> wrote:
Good to know it's not just me that fights with SELinux. :-)
Now that it works, what does the snapshot report reveal at the time the purple alerts go out?
Purples require a "no report" for 30 minutes to trigger. You might want to check all your logs at around 30-35 minutes before the emails.
On 11 September 2015 at 18:13, Colin Coe <colin.coe at gmail.com> wrote:
Almost...
Turned out to be SELinux, my old nemesis. :)
On Tue, Sep 8, 2015 at 5:37 PM, Vernon Everett <everett.vernon at gmail.com> wrote:
That might be a permissions thing.
On 8 September 2015 at 19:15, Colin Coe <colin.coe at gmail.com> wrote:
Hi Vernon
Thanks for the really good info. The message serial numbers are different every day but the messages are sent at the same time (13:45) daily for all tests on all hosts.
The network is not congested nor is the SAN under any kind of pressure.
Interestingly, trying to do the snapshot report gave me "Cannot create output directory".
Thanks again
CC
On Tue, Sep 8, 2015 at 3:56 PM, Vernon Everett <everett.vernon at gmail.com> wrote: > Hi Colin > > What do the client hosts share in common? > I have seen in the past, a client was overloading their storage > system, > and > were overflowing buffers and exceeding the storage array's ability > to > process IO requests. Of course this caused a general disk latency, > which > slowed things down to the point of a purple flood. > Was no simple solution to that one, except buy more storage, which > they > did. > > Also, check the "serial numbers" on the messages. Is this a repeat > of > an > older message - in which case Xymon might have something fishy > going > on, > or > are they new messages every day, as in it really thinks there is a > problem. > > Xymon only updates pages every 2 and 5 minutes, depending on the > page > you > are looking at. Meaning you could wait up to 7 minutes for the > real > status > to appear. > A purple takes 30 minutes to trigger. > With some unfortunate, and highly improbable timing on whatever is > triggering these events, it's possible you might not see the > purple. > Have you pulled up a "snapshot report" for the exact time of the > messages? > > Something else unlikely, but possible, is the network. > The conn test used ping, which is UDP > The Xymon agent sends using TCP. > Is there anything interesting happening on the network at the > time? > > Regards > Vernon > > > > On 8 September 2015 at 11:39, Colin Coe <colin.coe at gmail.com> > wrote: >> >> Hi all >> >> Since Friday September 4, I've started receiving "stopped >> reporting >> (PURPLE)" messages for all tests on all hosts from one of our >> Xymon >> servers. >> >> The host status, as shown in the Main View, is green for all >> hosts >> and >> tests. No purple at all. >> >> The "stopped reporting (PURPLE)" messages are being sent at the >> same >> time every day, 1:45PM. >> >> Any advise on how I should track this down? >> >> Thanks >> _______________________________________________ >> Xymon mailing list >> Xymon at xymon.com >> http://lists.xymon.com/mailman/listinfo/xymon > > > > > -- > "Accept the challenges so that you can feel the exhilaration of > victory" > - General George Patton
-- "Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
-- "Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
-- "Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
Xymon mailing list Xymon at xymon.com http://lists.xymon.com/mailman/listinfo/xymon
Hi all
The date/time is set correctly:
timedatectl Local time: Wed 2015-09-16 14:23:45 AWST Universal time: Wed 2015-09-16 06:23:45 UTC RTC time: Wed 2015-09-16 06:23:42 Timezone: Australia/Perth (AWST, +0800) NTP enabled: yes NTP synchronized: yes RTC in local TZ: no DST active: n/a
fping responds with "host is alive", ping responds with "normal" ping successful output.
Anyone else have any ideas on this, I really don't want to have to blow this server away and start again...
Thanks
On Tue, Sep 15, 2015 at 11:44 PM, Ribeiro, Glauber <glauber.ribeiro at experian.com> wrote:
Could it be something with the clock on the xymon server? Maybe some cron process to synchronize to a time server?
-----Original Message----- From: Xymon [mailto:xymon-bounces at xymon.com] On Behalf Of Colin Coe Sent: Monday, September 14, 2015 22:29 To: Vernon Everett Cc: xymon at xymon.com Subject: Re: [Xymon] Spurious purple messages
Hi Vernon,
Yep, very interesting. The purple messages come through every day at about the same time, give or take a minute or so.
Yep, pings work and the normal "main view" and "all non-green view" works fine.
The logs look fine. I'd really like to get to the bottom of this...
Thanks
CC
On Tue, Sep 15, 2015 at 10:06 AM, Vernon Everett <everett.vernon at gmail.com> wrote:
That's interesting. No idea what it means, or where to go from here, but it's certainly interesting.
Does it happen the exact same time every day? Have you tried a ping from the Xymon host to the client at or around the time of the issue? See if there's any oddities?
Is there anything in the logs?
On 14 September 2015 at 15:17, Colin Coe <colin.coe at gmail.com> wrote:
OK, looking at this again. The main view looks fine, but the 'conn' test on every host is a yellow circle with a question mark (unknown) in the snapshot report view since September 4, 2015 at 13:32:42.
September 4, 2015 at 13:32:41 and earlier look fine.
Thanks
On Sat, Sep 12, 2015 at 5:48 PM, Vernon Everett <everett.vernon at gmail.com> wrote:
Good to know it's not just me that fights with SELinux. :-)
Now that it works, what does the snapshot report reveal at the time the purple alerts go out?
Purples require a "no report" for 30 minutes to trigger. You might want to check all your logs at around 30-35 minutes before the emails.
On 11 September 2015 at 18:13, Colin Coe <colin.coe at gmail.com> wrote:
Almost...
Turned out to be SELinux, my old nemesis. :)
On Tue, Sep 8, 2015 at 5:37 PM, Vernon Everett <everett.vernon at gmail.com> wrote:
That might be a permissions thing.
On 8 September 2015 at 19:15, Colin Coe <colin.coe at gmail.com> wrote: > > Hi Vernon > > Thanks for the really good info. The message serial numbers are > different every day but the messages are sent at the same time > (13:45) > daily for all tests on all hosts. > > The network is not congested nor is the SAN under any kind of > pressure. > > Interestingly, trying to do the snapshot report gave me "Cannot > create > output directory". > > Thanks again > > CC > > On Tue, Sep 8, 2015 at 3:56 PM, Vernon Everett > <everett.vernon at gmail.com> > wrote: > > Hi Colin > > > > What do the client hosts share in common? > > I have seen in the past, a client was overloading their storage > > system, > > and > > were overflowing buffers and exceeding the storage array's ability > > to > > process IO requests. Of course this caused a general disk latency, > > which > > slowed things down to the point of a purple flood. > > Was no simple solution to that one, except buy more storage, which > > they > > did. > > > > Also, check the "serial numbers" on the messages. Is this a repeat > > of > > an > > older message - in which case Xymon might have something fishy > > going > > on, > > or > > are they new messages every day, as in it really thinks there is a > > problem. > > > > Xymon only updates pages every 2 and 5 minutes, depending on the > > page > > you > > are looking at. Meaning you could wait up to 7 minutes for the > > real > > status > > to appear. > > A purple takes 30 minutes to trigger. > > With some unfortunate, and highly improbable timing on whatever is > > triggering these events, it's possible you might not see the > > purple. > > Have you pulled up a "snapshot report" for the exact time of the > > messages? > > > > Something else unlikely, but possible, is the network. > > The conn test used ping, which is UDP > > The Xymon agent sends using TCP. > > Is there anything interesting happening on the network at the > > time? > > > > Regards > > Vernon > > > > > > > > On 8 September 2015 at 11:39, Colin Coe <colin.coe at gmail.com> > > wrote: > >> > >> Hi all > >> > >> Since Friday September 4, I've started receiving "stopped > >> reporting > >> (PURPLE)" messages for all tests on all hosts from one of our > >> Xymon > >> servers. > >> > >> The host status, as shown in the Main View, is green for all > >> hosts > >> and > >> tests. No purple at all. > >> > >> The "stopped reporting (PURPLE)" messages are being sent at the > >> same > >> time every day, 1:45PM. > >> > >> Any advise on how I should track this down? > >> > >> Thanks > >> _______________________________________________ > >> Xymon mailing list > >> Xymon at xymon.com > >> http://lists.xymon.com/mailman/listinfo/xymon > > > > > > > > > > -- > > "Accept the challenges so that you can feel the exhilaration of > > victory" > > - General George Patton
-- "Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
-- "Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
-- "Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
Xymon mailing list Xymon at xymon.com http://lists.xymon.com/mailman/listinfo/xymon
So within the 30 mins prior to the purple state, is the conn test time incrementing with normal test times or is it stuck at 1:15PM and not updating? In other words is it an actual purple event or a false positive? If it is a false positive, no idea....
If it is an actual purple event:
Have you run the pings during the 'purple time' to see that comms actually works then? Try tcp type connections, eg script a wget or whatever to run every few seconds. How about a tcpdump through that period? How about running a ps listing every 15 seconds or vmstat through the period to see if anything is amiss.
Have you tried eliminating tests/hosts? For example, does it happen with one host and just the conn test? All hosts with only the conn test?
Are there any tests that take a long time (ie look at the xymongen and xymonnet stats for the xymon server) or test that are blocking - eg nfs hard mounts?
From: Xymon <xymon-bounces at xymon.com> on behalf of Colin Coe <colin.coe at gmail.com> Sent: Wednesday, 16 September 2015 3:56 PM To: Ribeiro, Glauber Cc: xymon at xymon.com Subject: Re: [Xymon] Spurious purple messages
Hi all
The date/time is set correctly:
timedatectl Local time: Wed 2015-09-16 14:23:45 AWST Universal time: Wed 2015-09-16 06:23:45 UTC RTC time: Wed 2015-09-16 06:23:42 Timezone: Australia/Perth (AWST, +0800) NTP enabled: yes NTP synchronized: yes RTC in local TZ: no DST active: n/a
fping responds with "host is alive", ping responds with "normal" ping successful output.
Anyone else have any ideas on this, I really don't want to have to blow this server away and start again...
Thanks
On Tue, Sep 15, 2015 at 11:44 PM, Ribeiro, Glauber <glauber.ribeiro at experian.com> wrote:
Could it be something with the clock on the xymon server? Maybe some cron process to synchronize to a time server?
-----Original Message----- From: Xymon [mailto:xymon-bounces at xymon.com] On Behalf Of Colin Coe Sent: Monday, September 14, 2015 22:29 To: Vernon Everett Cc: xymon at xymon.com Subject: Re: [Xymon] Spurious purple messages
Hi Vernon,
Yep, very interesting. The purple messages come through every day at about the same time, give or take a minute or so.
Yep, pings work and the normal "main view" and "all non-green view" works fine.
The logs look fine. I'd really like to get to the bottom of this...
Thanks
CC
On Tue, Sep 15, 2015 at 10:06 AM, Vernon Everett <everett.vernon at gmail.com> wrote:
That's interesting. No idea what it means, or where to go from here, but it's certainly interesting.
Does it happen the exact same time every day? Have you tried a ping from the Xymon host to the client at or around the time of the issue? See if there's any oddities?
Is there anything in the logs?
On 14 September 2015 at 15:17, Colin Coe <colin.coe at gmail.com> wrote:
OK, looking at this again. The main view looks fine, but the 'conn' test on every host is a yellow circle with a question mark (unknown) in the snapshot report view since September 4, 2015 at 13:32:42.
September 4, 2015 at 13:32:41 and earlier look fine.
Thanks
On Sat, Sep 12, 2015 at 5:48 PM, Vernon Everett <everett.vernon at gmail.com> wrote:
Good to know it's not just me that fights with SELinux. :-)
Now that it works, what does the snapshot report reveal at the time the purple alerts go out?
Purples require a "no report" for 30 minutes to trigger. You might want to check all your logs at around 30-35 minutes before the emails.
On 11 September 2015 at 18:13, Colin Coe <colin.coe at gmail.com> wrote:
Almost...
Turned out to be SELinux, my old nemesis. :)
On Tue, Sep 8, 2015 at 5:37 PM, Vernon Everett <everett.vernon at gmail.com> wrote:
That might be a permissions thing.
On 8 September 2015 at 19:15, Colin Coe <colin.coe at gmail.com> wrote: > > Hi Vernon > > Thanks for the really good info. The message serial numbers are > different every day but the messages are sent at the same time > (13:45) > daily for all tests on all hosts. > > The network is not congested nor is the SAN under any kind of > pressure. > > Interestingly, trying to do the snapshot report gave me "Cannot > create > output directory". > > Thanks again > > CC > > On Tue, Sep 8, 2015 at 3:56 PM, Vernon Everett > <everett.vernon at gmail.com> > wrote: > > Hi Colin > > > > What do the client hosts share in common? > > I have seen in the past, a client was overloading their storage > > system, > > and > > were overflowing buffers and exceeding the storage array's ability > > to > > process IO requests. Of course this caused a general disk latency, > > which > > slowed things down to the point of a purple flood. > > Was no simple solution to that one, except buy more storage, which > > they > > did. > > > > Also, check the "serial numbers" on the messages. Is this a repeat > > of > > an > > older message - in which case Xymon might have something fishy > > going > > on, > > or > > are they new messages every day, as in it really thinks there is a > > problem. > > > > Xymon only updates pages every 2 and 5 minutes, depending on the > > page > > you > > are looking at. Meaning you could wait up to 7 minutes for the > > real > > status > > to appear. > > A purple takes 30 minutes to trigger. > > With some unfortunate, and highly improbable timing on whatever is > > triggering these events, it's possible you might not see the > > purple. > > Have you pulled up a "snapshot report" for the exact time of the > > messages? > > > > Something else unlikely, but possible, is the network. > > The conn test used ping, which is UDP > > The Xymon agent sends using TCP. > > Is there anything interesting happening on the network at the > > time? > > > > Regards > > Vernon > > > > > > > > On 8 September 2015 at 11:39, Colin Coe <colin.coe at gmail.com> > > wrote: > >> > >> Hi all > >> > >> Since Friday September 4, I've started receiving "stopped > >> reporting > >> (PURPLE)" messages for all tests on all hosts from one of our > >> Xymon > >> servers. > >> > >> The host status, as shown in the Main View, is green for all > >> hosts > >> and > >> tests. No purple at all. > >> > >> The "stopped reporting (PURPLE)" messages are being sent at the > >> same > >> time every day, 1:45PM. > >> > >> Any advise on how I should track this down? > >> > >> Thanks > >> _______________________________________________ > >> Xymon mailing list > >> Xymon at xymon.com > >> http://lists.xymon.com/mailman/listinfo/xymon > > > > > > > > > > -- > > "Accept the challenges so that you can feel the exhilaration of > > victory" > > - General George Patton
-- "Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
-- "Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
-- "Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
Xymon mailing list Xymon at xymon.com http://lists.xymon.com/mailman/listinfo/xymon
Xymon mailing list Xymon at xymon.com http://lists.xymon.com/mailman/listinfo/xymon
Sorry, I wasn't clear. I was wondering if there could be some process set up in cron to adjust the time, which could be causing this (bumping the server time once a day). Just hypothetical, unlikely.
g
-----Original Message----- From: Colin Coe [mailto:colin.coe at gmail.com] Sent: Wednesday, September 16, 2015 01:26 To: Ribeiro, Glauber Cc: Vernon Everett; xymon at xymon.com Subject: Re: [Xymon] Spurious purple messages
Hi all
The date/time is set correctly:
timedatectl Local time: Wed 2015-09-16 14:23:45 AWST Universal time: Wed 2015-09-16 06:23:45 UTC RTC time: Wed 2015-09-16 06:23:42 Timezone: Australia/Perth (AWST, +0800) NTP enabled: yes NTP synchronized: yes RTC in local TZ: no DST active: n/a
fping responds with "host is alive", ping responds with "normal" ping successful output.
Anyone else have any ideas on this, I really don't want to have to blow this server away and start again...
Thanks
On Tue, Sep 15, 2015 at 11:44 PM, Ribeiro, Glauber <glauber.ribeiro at experian.com> wrote:
Could it be something with the clock on the xymon server? Maybe some cron process to synchronize to a time server?
-----Original Message----- From: Xymon [mailto:xymon-bounces at xymon.com] On Behalf Of Colin Coe Sent: Monday, September 14, 2015 22:29 To: Vernon Everett Cc: xymon at xymon.com Subject: Re: [Xymon] Spurious purple messages
Hi Vernon,
Yep, very interesting. The purple messages come through every day at about the same time, give or take a minute or so.
Yep, pings work and the normal "main view" and "all non-green view" works fine.
The logs look fine. I'd really like to get to the bottom of this...
Thanks
CC
On Tue, Sep 15, 2015 at 10:06 AM, Vernon Everett <everett.vernon at gmail.com> wrote:
That's interesting. No idea what it means, or where to go from here, but it's certainly interesting.
Does it happen the exact same time every day? Have you tried a ping from the Xymon host to the client at or around the time of the issue? See if there's any oddities?
Is there anything in the logs?
On 14 September 2015 at 15:17, Colin Coe <colin.coe at gmail.com> wrote:
OK, looking at this again. The main view looks fine, but the 'conn' test on every host is a yellow circle with a question mark (unknown) in the snapshot report view since September 4, 2015 at 13:32:42.
September 4, 2015 at 13:32:41 and earlier look fine.
Thanks
On Sat, Sep 12, 2015 at 5:48 PM, Vernon Everett <everett.vernon at gmail.com> wrote:
Good to know it's not just me that fights with SELinux. :-)
Now that it works, what does the snapshot report reveal at the time the purple alerts go out?
Purples require a "no report" for 30 minutes to trigger. You might want to check all your logs at around 30-35 minutes before the emails.
On 11 September 2015 at 18:13, Colin Coe <colin.coe at gmail.com> wrote:
Almost...
Turned out to be SELinux, my old nemesis. :)
On Tue, Sep 8, 2015 at 5:37 PM, Vernon Everett <everett.vernon at gmail.com> wrote:
That might be a permissions thing.
On 8 September 2015 at 19:15, Colin Coe <colin.coe at gmail.com> wrote: > > Hi Vernon > > Thanks for the really good info. The message serial numbers are > different every day but the messages are sent at the same time > (13:45) > daily for all tests on all hosts. > > The network is not congested nor is the SAN under any kind of > pressure. > > Interestingly, trying to do the snapshot report gave me "Cannot > create > output directory". > > Thanks again > > CC > > On Tue, Sep 8, 2015 at 3:56 PM, Vernon Everett > <everett.vernon at gmail.com> > wrote: > > Hi Colin > > > > What do the client hosts share in common? > > I have seen in the past, a client was overloading their storage > > system, > > and > > were overflowing buffers and exceeding the storage array's ability > > to > > process IO requests. Of course this caused a general disk latency, > > which > > slowed things down to the point of a purple flood. > > Was no simple solution to that one, except buy more storage, which > > they > > did. > > > > Also, check the "serial numbers" on the messages. Is this a repeat > > of > > an > > older message - in which case Xymon might have something fishy > > going > > on, > > or > > are they new messages every day, as in it really thinks there is a > > problem. > > > > Xymon only updates pages every 2 and 5 minutes, depending on the > > page > > you > > are looking at. Meaning you could wait up to 7 minutes for the > > real > > status > > to appear. > > A purple takes 30 minutes to trigger. > > With some unfortunate, and highly improbable timing on whatever is > > triggering these events, it's possible you might not see the > > purple. > > Have you pulled up a "snapshot report" for the exact time of the > > messages? > > > > Something else unlikely, but possible, is the network. > > The conn test used ping, which is UDP > > The Xymon agent sends using TCP. > > Is there anything interesting happening on the network at the > > time? > > > > Regards > > Vernon > > > > > > > > On 8 September 2015 at 11:39, Colin Coe <colin.coe at gmail.com> > > wrote: > >> > >> Hi all > >> > >> Since Friday September 4, I've started receiving "stopped > >> reporting > >> (PURPLE)" messages for all tests on all hosts from one of our > >> Xymon > >> servers. > >> > >> The host status, as shown in the Main View, is green for all > >> hosts > >> and > >> tests. No purple at all. > >> > >> The "stopped reporting (PURPLE)" messages are being sent at the > >> same > >> time every day, 1:45PM. > >> > >> Any advise on how I should track this down? > >> > >> Thanks > >> _______________________________________________ > >> Xymon mailing list > >> Xymon at xymon.com > >> http://lists.xymon.com/mailman/listinfo/xymon > > > > > > > > > > -- > > "Accept the challenges so that you can feel the exhilaration of > > victory" > > - General George Patton
-- "Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
-- "Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
-- "Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
Xymon mailing list Xymon at xymon.com http://lists.xymon.com/mailman/listinfo/xymon
Glauber, I can confirm there are no cron jobs or similar that alter the time.
Phil, I can confirm that it is a false positive.
I figure there must be some stale data somewhere but I've not found it. What process sends the notifications? Where does this process get its data?
Thanks all
On Wed, Sep 16, 2015 at 10:01 PM, Ribeiro, Glauber <glauber.ribeiro at experian.com> wrote:
Sorry, I wasn't clear. I was wondering if there could be some process set up in cron to adjust the time, which could be causing this (bumping the server time once a day). Just hypothetical, unlikely.
g
-----Original Message----- From: Colin Coe [mailto:colin.coe at gmail.com] Sent: Wednesday, September 16, 2015 01:26 To: Ribeiro, Glauber Cc: Vernon Everett; xymon at xymon.com Subject: Re: [Xymon] Spurious purple messages
Hi all
The date/time is set correctly:
timedatectl Local time: Wed 2015-09-16 14:23:45 AWST Universal time: Wed 2015-09-16 06:23:45 UTC RTC time: Wed 2015-09-16 06:23:42 Timezone: Australia/Perth (AWST, +0800) NTP enabled: yes NTP synchronized: yes RTC in local TZ: no DST active: n/a
fping responds with "host is alive", ping responds with "normal" ping successful output.
Anyone else have any ideas on this, I really don't want to have to blow this server away and start again...
Thanks
On Tue, Sep 15, 2015 at 11:44 PM, Ribeiro, Glauber <glauber.ribeiro at experian.com> wrote:
Could it be something with the clock on the xymon server? Maybe some cron process to synchronize to a time server?
-----Original Message----- From: Xymon [mailto:xymon-bounces at xymon.com] On Behalf Of Colin Coe Sent: Monday, September 14, 2015 22:29 To: Vernon Everett Cc: xymon at xymon.com Subject: Re: [Xymon] Spurious purple messages
Hi Vernon,
Yep, very interesting. The purple messages come through every day at about the same time, give or take a minute or so.
Yep, pings work and the normal "main view" and "all non-green view" works fine.
The logs look fine. I'd really like to get to the bottom of this...
Thanks
CC
On Tue, Sep 15, 2015 at 10:06 AM, Vernon Everett <everett.vernon at gmail.com> wrote:
That's interesting. No idea what it means, or where to go from here, but it's certainly interesting.
Does it happen the exact same time every day? Have you tried a ping from the Xymon host to the client at or around the time of the issue? See if there's any oddities?
Is there anything in the logs?
On 14 September 2015 at 15:17, Colin Coe <colin.coe at gmail.com> wrote:
OK, looking at this again. The main view looks fine, but the 'conn' test on every host is a yellow circle with a question mark (unknown) in the snapshot report view since September 4, 2015 at 13:32:42.
September 4, 2015 at 13:32:41 and earlier look fine.
Thanks
On Sat, Sep 12, 2015 at 5:48 PM, Vernon Everett <everett.vernon at gmail.com> wrote:
Good to know it's not just me that fights with SELinux. :-)
Now that it works, what does the snapshot report reveal at the time the purple alerts go out?
Purples require a "no report" for 30 minutes to trigger. You might want to check all your logs at around 30-35 minutes before the emails.
On 11 September 2015 at 18:13, Colin Coe <colin.coe at gmail.com> wrote:
Almost...
Turned out to be SELinux, my old nemesis. :)
On Tue, Sep 8, 2015 at 5:37 PM, Vernon Everett <everett.vernon at gmail.com> wrote: > That might be a permissions thing. > > > > On 8 September 2015 at 19:15, Colin Coe <colin.coe at gmail.com> wrote: >> >> Hi Vernon >> >> Thanks for the really good info. The message serial numbers are >> different every day but the messages are sent at the same time >> (13:45) >> daily for all tests on all hosts. >> >> The network is not congested nor is the SAN under any kind of >> pressure. >> >> Interestingly, trying to do the snapshot report gave me "Cannot >> create >> output directory". >> >> Thanks again >> >> CC >> >> On Tue, Sep 8, 2015 at 3:56 PM, Vernon Everett >> <everett.vernon at gmail.com> >> wrote: >> > Hi Colin >> > >> > What do the client hosts share in common? >> > I have seen in the past, a client was overloading their storage >> > system, >> > and >> > were overflowing buffers and exceeding the storage array's ability >> > to >> > process IO requests. Of course this caused a general disk latency, >> > which >> > slowed things down to the point of a purple flood. >> > Was no simple solution to that one, except buy more storage, which >> > they >> > did. >> > >> > Also, check the "serial numbers" on the messages. Is this a repeat >> > of >> > an >> > older message - in which case Xymon might have something fishy >> > going >> > on, >> > or >> > are they new messages every day, as in it really thinks there is a >> > problem. >> > >> > Xymon only updates pages every 2 and 5 minutes, depending on the >> > page >> > you >> > are looking at. Meaning you could wait up to 7 minutes for the >> > real >> > status >> > to appear. >> > A purple takes 30 minutes to trigger. >> > With some unfortunate, and highly improbable timing on whatever is >> > triggering these events, it's possible you might not see the >> > purple. >> > Have you pulled up a "snapshot report" for the exact time of the >> > messages? >> > >> > Something else unlikely, but possible, is the network. >> > The conn test used ping, which is UDP >> > The Xymon agent sends using TCP. >> > Is there anything interesting happening on the network at the >> > time? >> > >> > Regards >> > Vernon >> > >> > >> > >> > On 8 September 2015 at 11:39, Colin Coe <colin.coe at gmail.com> >> > wrote: >> >> >> >> Hi all >> >> >> >> Since Friday September 4, I've started receiving "stopped >> >> reporting >> >> (PURPLE)" messages for all tests on all hosts from one of our >> >> Xymon >> >> servers. >> >> >> >> The host status, as shown in the Main View, is green for all >> >> hosts >> >> and >> >> tests. No purple at all. >> >> >> >> The "stopped reporting (PURPLE)" messages are being sent at the >> >> same >> >> time every day, 1:45PM. >> >> >> >> Any advise on how I should track this down? >> >> >> >> Thanks >> >> _______________________________________________ >> >> Xymon mailing list >> >> Xymon at xymon.com >> >> http://lists.xymon.com/mailman/listinfo/xymon >> > >> > >> > >> > >> > -- >> > "Accept the challenges so that you can feel the exhilaration of >> > victory" >> > - General George Patton > > > > > -- > "Accept the challenges so that you can feel the exhilaration of > victory" > - General George Patton
-- "Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
-- "Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
Xymon mailing list Xymon at xymon.com http://lists.xymon.com/mailman/listinfo/xymon
Hi all
I ended up resolving this by stopping the Xymon service, removing all files in $XYMONTMP and then starting xymon again.
Thanks all for the suggestions
CC
On Thu, Sep 17, 2015 at 6:28 AM, Colin Coe <colin.coe at gmail.com> wrote:
Glauber, I can confirm there are no cron jobs or similar that alter the time.
Phil, I can confirm that it is a false positive.
I figure there must be some stale data somewhere but I've not found it. What process sends the notifications? Where does this process get its data?
Thanks all
On Wed, Sep 16, 2015 at 10:01 PM, Ribeiro, Glauber <glauber.ribeiro at experian.com> wrote:
Sorry, I wasn't clear. I was wondering if there could be some process set up in cron to adjust the time, which could be causing this (bumping the server time once a day). Just hypothetical, unlikely.
g
-----Original Message----- From: Colin Coe [mailto:colin.coe at gmail.com] Sent: Wednesday, September 16, 2015 01:26 To: Ribeiro, Glauber Cc: Vernon Everett; xymon at xymon.com Subject: Re: [Xymon] Spurious purple messages
Hi all
The date/time is set correctly:
timedatectl Local time: Wed 2015-09-16 14:23:45 AWST Universal time: Wed 2015-09-16 06:23:45 UTC RTC time: Wed 2015-09-16 06:23:42 Timezone: Australia/Perth (AWST, +0800) NTP enabled: yes NTP synchronized: yes RTC in local TZ: no DST active: n/a
fping responds with "host is alive", ping responds with "normal" ping successful output.
Anyone else have any ideas on this, I really don't want to have to blow this server away and start again...
Thanks
On Tue, Sep 15, 2015 at 11:44 PM, Ribeiro, Glauber <glauber.ribeiro at experian.com> wrote:
Could it be something with the clock on the xymon server? Maybe some cron process to synchronize to a time server?
-----Original Message----- From: Xymon [mailto:xymon-bounces at xymon.com] On Behalf Of Colin Coe Sent: Monday, September 14, 2015 22:29 To: Vernon Everett Cc: xymon at xymon.com Subject: Re: [Xymon] Spurious purple messages
Hi Vernon,
Yep, very interesting. The purple messages come through every day at about the same time, give or take a minute or so.
Yep, pings work and the normal "main view" and "all non-green view" works fine.
The logs look fine. I'd really like to get to the bottom of this...
Thanks
CC
On Tue, Sep 15, 2015 at 10:06 AM, Vernon Everett <everett.vernon at gmail.com> wrote:
That's interesting. No idea what it means, or where to go from here, but it's certainly interesting.
Does it happen the exact same time every day? Have you tried a ping from the Xymon host to the client at or around the time of the issue? See if there's any oddities?
Is there anything in the logs?
On 14 September 2015 at 15:17, Colin Coe <colin.coe at gmail.com> wrote:
OK, looking at this again. The main view looks fine, but the 'conn' test on every host is a yellow circle with a question mark (unknown) in the snapshot report view since September 4, 2015 at 13:32:42.
September 4, 2015 at 13:32:41 and earlier look fine.
Thanks
On Sat, Sep 12, 2015 at 5:48 PM, Vernon Everett <everett.vernon at gmail.com> wrote:
Good to know it's not just me that fights with SELinux. :-)
Now that it works, what does the snapshot report reveal at the time the purple alerts go out?
Purples require a "no report" for 30 minutes to trigger. You might want to check all your logs at around 30-35 minutes before the emails.
On 11 September 2015 at 18:13, Colin Coe <colin.coe at gmail.com> wrote: > > Almost... > > Turned out to be SELinux, my old nemesis. :) > > > > On Tue, Sep 8, 2015 at 5:37 PM, Vernon Everett > <everett.vernon at gmail.com> > wrote: > > That might be a permissions thing. > > > > > > > > On 8 September 2015 at 19:15, Colin Coe <colin.coe at gmail.com> wrote: > >> > >> Hi Vernon > >> > >> Thanks for the really good info. The message serial numbers are > >> different every day but the messages are sent at the same time > >> (13:45) > >> daily for all tests on all hosts. > >> > >> The network is not congested nor is the SAN under any kind of > >> pressure. > >> > >> Interestingly, trying to do the snapshot report gave me "Cannot > >> create > >> output directory". > >> > >> Thanks again > >> > >> CC > >> > >> On Tue, Sep 8, 2015 at 3:56 PM, Vernon Everett > >> <everett.vernon at gmail.com> > >> wrote: > >> > Hi Colin > >> > > >> > What do the client hosts share in common? > >> > I have seen in the past, a client was overloading their storage > >> > system, > >> > and > >> > were overflowing buffers and exceeding the storage array's ability > >> > to > >> > process IO requests. Of course this caused a general disk latency, > >> > which > >> > slowed things down to the point of a purple flood. > >> > Was no simple solution to that one, except buy more storage, which > >> > they > >> > did. > >> > > >> > Also, check the "serial numbers" on the messages. Is this a repeat > >> > of > >> > an > >> > older message - in which case Xymon might have something fishy > >> > going > >> > on, > >> > or > >> > are they new messages every day, as in it really thinks there is a > >> > problem. > >> > > >> > Xymon only updates pages every 2 and 5 minutes, depending on the > >> > page > >> > you > >> > are looking at. Meaning you could wait up to 7 minutes for the > >> > real > >> > status > >> > to appear. > >> > A purple takes 30 minutes to trigger. > >> > With some unfortunate, and highly improbable timing on whatever is > >> > triggering these events, it's possible you might not see the > >> > purple. > >> > Have you pulled up a "snapshot report" for the exact time of the > >> > messages? > >> > > >> > Something else unlikely, but possible, is the network. > >> > The conn test used ping, which is UDP > >> > The Xymon agent sends using TCP. > >> > Is there anything interesting happening on the network at the > >> > time? > >> > > >> > Regards > >> > Vernon > >> > > >> > > >> > > >> > On 8 September 2015 at 11:39, Colin Coe <colin.coe at gmail.com> > >> > wrote: > >> >> > >> >> Hi all > >> >> > >> >> Since Friday September 4, I've started receiving "stopped > >> >> reporting > >> >> (PURPLE)" messages for all tests on all hosts from one of our > >> >> Xymon > >> >> servers. > >> >> > >> >> The host status, as shown in the Main View, is green for all > >> >> hosts > >> >> and > >> >> tests. No purple at all. > >> >> > >> >> The "stopped reporting (PURPLE)" messages are being sent at the > >> >> same > >> >> time every day, 1:45PM. > >> >> > >> >> Any advise on how I should track this down? > >> >> > >> >> Thanks > >> >> _______________________________________________ > >> >> Xymon mailing list > >> >> Xymon at xymon.com > >> >> http://lists.xymon.com/mailman/listinfo/xymon > >> > > >> > > >> > > >> > > >> > -- > >> > "Accept the challenges so that you can feel the exhilaration of > >> > victory" > >> > - General George Patton > > > > > > > > > > -- > > "Accept the challenges so that you can feel the exhilaration of > > victory" > > - General George Patton
-- "Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
-- "Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
Xymon mailing list Xymon at xymon.com http://lists.xymon.com/mailman/listinfo/xymon
participants (5)
-
colin.coe@gmail.com
-
everett.vernon@gmail.com
-
glauber.ribeiro@experian.com
-
Phil.Crooker@orix.com.au
-
Tony.Clark@uk.mizuho-sc.com