Discussion:
False Process Down Alerts
Chris Naude
2010-01-16 03:59:20 UTC
Permalink
I'm run into a strange problem with my Xymon server. I noticed today that
I'm receiving random false alerts for processes being down. When I look at
the process list output in the alert it looks as if the data coming from the
clients isn't correct. Here is an example. Has anyone seen anything like
this?

9613 1944 root Jan 11 S 154 0.00 00:00:00 6128 cmclconfd -c
10389 1944 root Jan 11 S 154 0.00 00:00:00 6128 cmclconfd -c
9794 1 oracle 10:55:57 S 154 0.00 00:00:0
217600]oracleTEST (LOCAL=NO)
1592 1 oracle Jan 11 S 154 0.00 00:00:11 217136 ora_mman_TEST
12751 1944 root Jan 11 S 154 0.00 00:00:00 6128 cmclconfd -c
8965 1944 root Jan 11 S 154 0.00 00:00:00 6128 cmclconfd -c


11819 1 oracle Jan 12 S 154 0.00 00:00:07 217280 ora_j015_TEST
2711 1 roo
]ec 4 S 120 0.04 00:02:16 868 /usr/sbin/xntpd
3547 1 xymon Dec 4 S 168 0.00 00:00:43 268
/opt/xymon/client/bin/hobbitlaunch
--config=/opt/xymon/client/etc/clientlaunch.cfg
--log=/opt/xymon/client/logs/clientlaunch.log
--pidfile=/opt/xymon/client/logs/clientlaunch.101.example.com.pid
3728 1 root Dec 4 R 152 0.00 00:00:37 4208
/usr/sbin/stm/uut/bin/tools/monitor/WbemWrapperMonitor



Xymon version: 4.3.0-0.beta2
Xymon server: CentOS 5.4 32 bit

Client: HP-UX 11.31 Itanium
--
Chris Naude
Lars Ebeling
2010-01-16 14:56:04 UTC
Permalink
It looks like two instances of the client are writing to the file at the same time or almost ;)

Lars
----- Original Message -----
From: Chris Naude
To: hobbit-pDmt/***@public.gmane.org
Sent: Saturday, January 16, 2010 4:59 AM
Subject: [hobbit] False Process Down Alerts


I'm run into a strange problem with my Xymon server. I noticed today that I'm receiving random false alerts for processes being down. When I look at the process list output in the alert it looks as if the data coming from the clients isn't correct. Here is an example. Has anyone seen anything like this?


9613 1944 root Jan 11 S 154 0.00 00:00:00 6128 cmclconfd -c
10389 1944 root Jan 11 S 154 0.00 00:00:00 6128 cmclconfd -c
9794 1 oracle 10:55:57 S 154 0.00 00:00:0
217600]oracleTEST (LOCAL=NO)
1592 1 oracle Jan 11 S 154 0.00 00:00:11 217136 ora_mman_TEST
12751 1944 root Jan 11 S 154 0.00 00:00:00 6128 cmclconfd -c
8965 1944 root Jan 11 S 154 0.00 00:00:00 6128 cmclconfd -c


11819 1 oracle Jan 12 S 154 0.00 00:00:07 217280 ora_j015_TEST
2711 1 roo
]ec 4 S 120 0.04 00:02:16 868 /usr/sbin/xntpd
3547 1 xymon Dec 4 S 168 0.00 00:00:43 268 /opt/xymon/client/bin/hobbitlaunch --config=/opt/xymon/client/etc/clientlaunch.cfg --log=/opt/xymon/client/logs/clientlaunch.log --pidfile=/opt/xymon/client/logs/clientlaunch.101.example.com.pid
3728 1 root Dec 4 R 152 0.00 00:00:37 4208 /usr/sbin/stm/uut/bin/tools/monitor/WbemWrapperMonitor




Xymon version: 4.3.0-0.beta2
Xymon server: CentOS 5.4 32 bit


Client: HP-UX 11.31 Itanium

--
Chris Naude
Chris Naude
2010-01-16 17:44:31 UTC
Permalink
That makes a lot of sense. I did have some issues with the startup scripts
on HP-UX. I'll check it out later tonight. Hopefully i can get it fixed
before it goes live tonight. Thanks!
Post by Lars Ebeling
It looks like two instances of the client are writing to the file at the
same time or almost ;)
Lars
----- Original Message -----
*Sent:* Saturday, January 16, 2010 4:59 AM
*Subject:* [hobbit] False Process Down Alerts
I'm run into a strange problem with my Xymon server. I noticed today that
I'm receiving random false alerts for processes being down. When I look at
the process list output in the alert it looks as if the data coming from the
clients isn't correct. Here is an example. Has anyone seen anything like
this?
9613 1944 root Jan 11 S 154 0.00 00:00:00 6128 cmclconfd -c
10389 1944 root Jan 11 S 154 0.00 00:00:00 6128 cmclconfd -c
9794 1 oracle 10:55:57 S 154 0.00 00:00:0
217600]oracleTEST (LOCAL=NO)
1592 1 oracle Jan 11 S 154 0.00 00:00:11 217136 ora_mman_TEST
12751 1944 root Jan 11 S 154 0.00 00:00:00 6128 cmclconfd -c
8965 1944 root Jan 11 S 154 0.00 00:00:00 6128 cmclconfd -c
11819 1 oracle Jan 12 S 154 0.00 00:00:07 217280 ora_j015_TEST
2711 1 roo
]ec 4 S 120 0.04 00:02:16 868 /usr/sbin/xntpd
3547 1 xymon Dec 4 S 168 0.00 00:00:43 268 /opt/xymon/client/bin/hobbitlaunch --config=/opt/xymon/client/etc/clientlaunch.cfg --log=/opt/xymon/client/logs/clientlaunch.log --pidfile=/opt/xymon/client/logs/clientlaunch.101.example.com.pid
3728 1 root Dec 4 R 152 0.00 00:00:37 4208 /usr/sbin/stm/uut/bin/tools/monitor/WbemWrapperMonitor
Xymon version: 4.3.0-0.beta2
Xymon server: CentOS 5.4 32 bit
Client: HP-UX 11.31 Itanium
--
Chris Naude
--
Chris Naude
Chris Naude
2010-01-17 23:11:44 UTC
Permalink
The problem has suddenly become much much worse. I verified with tcpdump
that the data coming from the client is 100% correct. It seems something on
the Xymon server side is not handling the client data correctly. Anyone have
any other ideas?

[image: red] 89% /testdb3 (37771472% used) has reached the PANIC level (95%)

Filesystem 1024-blocks Used Available Capacity Mounted on
/dev/vgtestdb1/lvol1 107844344 70901816 36942528 66% /testdb1
/dev/vgtestdb2/lvol1 35962064 25453128 10508936 71% /testdb2
/dev/vgtestdb4/lvol1 970909400 825006344 145903056 85% /testdb4
/dev/vgtestdb3/lv
l1 ] 338788224 301016752 37771472 89% /testdb3
/dev/vgtestdb5/lvol1 179789048 150553912 29235136 84% /testdb5
/dev/vg00/lvol8 24580711 74501 24506210 1% /home
/dev/vg00/lvol4 10226680 6339283 3887397 62% /opt
Post by Chris Naude
That makes a lot of sense. I did have some issues with the startup scripts
on HP-UX. I'll check it out later tonight. Hopefully i can get it fixed
before it goes live tonight. Thanks!
On Sat, Jan 16, 2010 at 7:56 AM, Lars Ebeling <
Post by Lars Ebeling
It looks like two instances of the client are writing to the file at the
same time or almost ;)
Lars
----- Original Message -----
*Sent:* Saturday, January 16, 2010 4:59 AM
*Subject:* [hobbit] False Process Down Alerts
I'm run into a strange problem with my Xymon server. I noticed today that
I'm receiving random false alerts for processes being down. When I look at
the process list output in the alert it looks as if the data coming from the
clients isn't correct. Here is an example. Has anyone seen anything like
this?
9613 1944 root Jan 11 S 154 0.00 00:00:00 6128 cmclconfd -c
10389 1944 root Jan 11 S 154 0.00 00:00:00 6128 cmclconfd -c
9794 1 oracle 10:55:57 S 154 0.00 00:00:0
217600]oracleTEST (LOCAL=NO)
1592 1 oracle Jan 11 S 154 0.00 00:00:11 217136 ora_mman_TEST
12751 1944 root Jan 11 S 154 0.00 00:00:00 6128 cmclconfd -c
8965 1944 root Jan 11 S 154 0.00 00:00:00 6128 cmclconfd -c
11819 1 oracle Jan 12 S 154 0.00 00:00:07 217280 ora_j015_TEST
2711 1 roo
]ec 4 S 120 0.04 00:02:16 868 /usr/sbin/xntpd
3547 1 xymon Dec 4 S 168 0.00 00:00:43 268 /opt/xymon/client/bin/hobbitlaunch --config=/opt/xymon/client/etc/clientlaunch.cfg --log=/opt/xymon/client/logs/clientlaunch.log --pidfile=/opt/xymon/client/logs/clientlaunch.101.example.com.pid
3728 1 root Dec 4 R 152 0.00 00:00:37 4208 /usr/sbin/stm/uut/bin/tools/monitor/WbemWrapperMonitor
Xymon version: 4.3.0-0.beta2
Xymon server: CentOS 5.4 32 bit
Client: HP-UX 11.31 Itanium
--
Chris Naude
--
Chris Naude
--
Chris Naude
Josh Luthman
2010-01-17 23:21:15 UTC
Permalink
Is there only one client sending data as this name? I don't think you
answered Lars' email.

What does the alert read and what does the data say? Missing process? Too
high of a load?

Josh Luthman
Office: 937-552-2340
Direct: 937-552-2343
1100 Wayne St
Suite 1337
Troy, OH 45373

"The secret to creativity is knowing how to hide your sources."
--- Albert Einstein
Post by Chris Naude
The problem has suddenly become much much worse. I verified with tcpdump
that the data coming from the client is 100% correct. It seems something on
the Xymon server side is not handling the client data correctly. Anyone have
any other ideas?
[image: red] 89% /testdb3 (37771472% used) has reached the PANIC level (95%)
Filesystem 1024-blocks Used Available Capacity Mounted on
/dev/vgtestdb1/lvol1 107844344 70901816 36942528 66% /testdb1
/dev/vgtestdb2/lvol1 35962064 25453128 10508936 71% /testdb2
/dev/vgtestdb4/lvol1 970909400 825006344 145903056 85% /testdb4
/dev/vgtestdb3/lv
l1 ] 338788224 301016752 37771472 89% /testdb3
/dev/vgtestdb5/lvol1 179789048 150553912 29235136 84% /testdb5
/dev/vg00/lvol8 24580711 74501 24506210 1% /home
/dev/vg00/lvol4 10226680 6339283 3887397 62% /opt
Post by Chris Naude
That makes a lot of sense. I did have some issues with the startup scripts
on HP-UX. I'll check it out later tonight. Hopefully i can get it fixed
before it goes live tonight. Thanks!
On Sat, Jan 16, 2010 at 7:56 AM, Lars Ebeling <
Post by Lars Ebeling
It looks like two instances of the client are writing to the file at
the same time or almost ;)
Lars
----- Original Message -----
*Sent:* Saturday, January 16, 2010 4:59 AM
*Subject:* [hobbit] False Process Down Alerts
I'm run into a strange problem with my Xymon server. I noticed today that
I'm receiving random false alerts for processes being down. When I look at
the process list output in the alert it looks as if the data coming from the
clients isn't correct. Here is an example. Has anyone seen anything like
this?
9613 1944 root Jan 11 S 154 0.00 00:00:00 6128 cmclconfd -c
10389 1944 root Jan 11 S 154 0.00 00:00:00 6128 cmclconfd -c
9794 1 oracle 10:55:57 S 154 0.00 00:00:0
217600]oracleTEST (LOCAL=NO)
1592 1 oracle Jan 11 S 154 0.00 00:00:11 217136 ora_mman_TEST
12751 1944 root Jan 11 S 154 0.00 00:00:00 6128 cmclconfd -c
8965 1944 root Jan 11 S 154 0.00 00:00:00 6128 cmclconfd -c
11819 1 oracle Jan 12 S 154 0.00 00:00:07 217280 ora_j015_TEST
2711 1 roo
]ec 4 S 120 0.04 00:02:16 868 /usr/sbin/xntpd
3547 1 xymon Dec 4 S 168 0.00 00:00:43 268 /opt/xymon/client/bin/hobbitlaunch --config=/opt/xymon/client/etc/clientlaunch.cfg --log=/opt/xymon/client/logs/clientlaunch.log --pidfile=/opt/xymon/client/logs/clientlaunch.101.example.com.pid
3728 1 root Dec 4 R 152 0.00 00:00:37 4208 /usr/sbin/stm/uut/bin/tools/monitor/WbemWrapperMonitor
Xymon version: 4.3.0-0.beta2
Xymon server: CentOS 5.4 32 bit
Client: HP-UX 11.31 Itanium
--
Chris Naude
--
Chris Naude
--
Chris Naude
Chris Naude
2010-01-18 00:08:28 UTC
Permalink
I have 7 clients running. Each client has a different name. They are all
sending data to the primary Xymon server. The alerts are reading missing
processes, full file systems, and msgs errors. Here is another sample of an
unusual error. You can see the process list has a funky break in it.

Sun Jan 17 15:40:18 MST 2010 - Processes NOT ok

[image: yellow] Expected string COMMAND not found in ps output header

PID PPID USER
STIM] S PRI %CPU TIME VSZ COMMAND
0 0 root Dec 14 S 127 0.16 00:40:00 0 swapper
1 0 root Dec 14 R 152 0.09 00:01:21 2064 init
48 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached
45 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached
42 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached
31 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached
30 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached
29 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached
28 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached
26 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached
5 0 root Dec 14 R 152 0.00 00:00:02 0 signald
6 0 root Dec 14 R 152 0.00 00:00:03 0 kmemdaemon
17 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached
16 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached
15 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached
14 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached
13 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached
12 0 root Dec 14 S 152 0.00 00:00:00 0 usbhubd
11 0 root Dec 14 R 152 0.00 00:01:11 0 escsid
10 0 root Dec 14 S -32 0.00 00:00:00 0 ttisr
9 0 root Dec 14 R 152 0.00 00:01:27 0 ksyncer_daemon

7 0]root Dec 14 R 152
0.00 00:]0:00 0 kai_daemon
50 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached
47 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached
44 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached
41 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached


On Sun, Jan 17, 2010 at 4:21 PM, Josh Luthman
Post by Josh Luthman
Is there only one client sending data as this name? I don't think you
answered Lars' email.
What does the alert read and what does the data say? Missing process? Too
high of a load?
Josh Luthman
Office: 937-552-2340
Direct: 937-552-2343
1100 Wayne St
Suite 1337
Troy, OH 45373
"The secret to creativity is knowing how to hide your sources."
--- Albert Einstein
Post by Chris Naude
The problem has suddenly become much much worse. I verified with tcpdump
that the data coming from the client is 100% correct. It seems something on
the Xymon server side is not handling the client data correctly. Anyone have
any other ideas?
[image: red] 89% /testdb3 (37771472% used) has reached the PANIC level (95%)
Filesystem 1024-blocks Used Available Capacity Mounted on
/dev/vgtestdb1/lvol1 107844344 70901816 36942528 66% /testdb1
/dev/vgtestdb2/lvol1 35962064 25453128 10508936 71% /testdb2
/dev/vgtestdb4/lvol1 970909400 825006344 145903056 85% /testdb4
/dev/vgtestdb3/lv
l1 ] 338788224 301016752 37771472 89% /testdb3
/dev/vgtestdb5/lvol1 179789048 150553912 29235136 84% /testdb5
/dev/vg00/lvol8 24580711 74501 24506210 1% /home
/dev/vg00/lvol4 10226680 6339283 3887397 62% /opt
Post by Chris Naude
That makes a lot of sense. I did have some issues with the startup
scripts on HP-UX. I'll check it out later tonight. Hopefully i can get it
fixed before it goes live tonight. Thanks!
On Sat, Jan 16, 2010 at 7:56 AM, Lars Ebeling <
Post by Lars Ebeling
It looks like two instances of the client are writing to the file at
the same time or almost ;)
Lars
----- Original Message -----
*Sent:* Saturday, January 16, 2010 4:59 AM
*Subject:* [hobbit] False Process Down Alerts
I'm run into a strange problem with my Xymon server. I noticed today
that I'm receiving random false alerts for processes being down. When I look
at the process list output in the alert it looks as if the data coming from
the clients isn't correct. Here is an example. Has anyone seen anything like
this?
9613 1944 root Jan 11 S 154 0.00 00:00:00 6128 cmclconfd -c
10389 1944 root Jan 11 S 154 0.00 00:00:00 6128 cmclconfd -c
9794 1 oracle 10:55:57 S 154 0.00 00:00:0
217600]oracleTEST (LOCAL=NO)
1592 1 oracle Jan 11 S 154 0.00 00:00:11 217136 ora_mman_TEST
12751 1944 root Jan 11 S 154 0.00 00:00:00 6128 cmclconfd -c
8965 1944 root Jan 11 S 154 0.00 00:00:00 6128 cmclconfd -c
11819 1 oracle Jan 12 S 154 0.00 00:00:07 217280 ora_j015_TEST
2711 1 roo
]ec 4 S 120 0.04 00:02:16 868 /usr/sbin/xntpd
3547 1 xymon Dec 4 S 168 0.00 00:00:43 268 /opt/xymon/client/bin/hobbitlaunch --config=/opt/xymon/client/etc/clientlaunch.cfg --log=/opt/xymon/client/logs/clientlaunch.log --pidfile=/opt/xymon/client/logs/clientlaunch.101.example.com.pid
3728 1 root Dec 4 R 152 0.00 00:00:37 4208 /usr/sbin/stm/uut/bin/tools/monitor/WbemWrapperMonitor
Xymon version: 4.3.0-0.beta2
Xymon server: CentOS 5.4 32 bit
Client: HP-UX 11.31 Itanium
--
Chris Naude
--
Chris Naude
--
Chris Naude
--
Chris Naude
Chris Naude
2010-01-18 19:20:43 UTC
Permalink
I've managed to stop the flood of false alerts. I removed all of my non-prod
clients from the bb-hosts and shut off their client processes. The problem
seems to be somehow related to the amount of data the Xymon server is trying
to process.
Post by Chris Naude
I have 7 clients running. Each client has a different name. They are all
sending data to the primary Xymon server. The alerts are reading missing
processes, full file systems, and msgs errors. Here is another sample of an
unusual error. You can see the process list has a funky break in it.
Sun Jan 17 15:40:18 MST 2010 - Processes NOT ok
[image: yellow] Expected string COMMAND not found in ps output header
PID PPID USER
STIM] S PRI %CPU TIME VSZ COMMAND
0 0 root Dec 14 S 127 0.16 00:40:00 0 swapper
1 0 root Dec 14 R 152 0.09 00:01:21 2064 init
48 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached
45 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached
42 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached
31 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached
30 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached
29 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached
28 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached
26 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached
5 0 root Dec 14 R 152 0.00 00:00:02 0 signald
6 0 root Dec 14 R 152 0.00 00:00:03 0 kmemdaemon
17 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached
16 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached
15 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached
14 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached
13 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached
12 0 root Dec 14 S 152 0.00 00:00:00 0 usbhubd
11 0 root Dec 14 R 152 0.00 00:01:11 0 escsid
10 0 root Dec 14 S -32 0.00 00:00:00 0 ttisr
9 0 root Dec 14 R 152 0.00 00:01:27 0 ksyncer_daemon
7 0]root Dec 14 R 152
0.00 00:]0:00 0 kai_daemon
50 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached
47 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached
44 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached
41 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached
Post by Josh Luthman
Is there only one client sending data as this name? I don't think you
answered Lars' email.
What does the alert read and what does the data say? Missing process?
Too high of a load?
Josh Luthman
Office: 937-552-2340
Direct: 937-552-2343
1100 Wayne St
Suite 1337
Troy, OH 45373
"The secret to creativity is knowing how to hide your sources."
--- Albert Einstein
Post by Chris Naude
The problem has suddenly become much much worse. I verified with tcpdump
that the data coming from the client is 100% correct. It seems something on
the Xymon server side is not handling the client data correctly. Anyone have
any other ideas?
[image: red] 89% /testdb3 (37771472% used) has reached the PANIC level (95%)
Filesystem 1024-blocks Used Available Capacity Mounted on
/dev/vgtestdb1/lvol1 107844344 70901816 36942528 66% /testdb1
/dev/vgtestdb2/lvol1 35962064 25453128 10508936 71% /testdb2
/dev/vgtestdb4/lvol1 970909400 825006344 145903056 85% /testdb4
/dev/vgtestdb3/lv
l1 ] 338788224 301016752 37771472 89% /testdb3
/dev/vgtestdb5/lvol1 179789048 150553912 29235136 84% /testdb5
/dev/vg00/lvol8 24580711 74501 24506210 1% /home
/dev/vg00/lvol4 10226680 6339283 3887397 62% /opt
Post by Chris Naude
That makes a lot of sense. I did have some issues with the startup
scripts on HP-UX. I'll check it out later tonight. Hopefully i can get it
fixed before it goes live tonight. Thanks!
On Sat, Jan 16, 2010 at 7:56 AM, Lars Ebeling <
Post by Lars Ebeling
It looks like two instances of the client are writing to the file at
the same time or almost ;)
Lars
----- Original Message -----
*Sent:* Saturday, January 16, 2010 4:59 AM
*Subject:* [hobbit] False Process Down Alerts
I'm run into a strange problem with my Xymon server. I noticed today
that I'm receiving random false alerts for processes being down. When I look
at the process list output in the alert it looks as if the data coming from
the clients isn't correct. Here is an example. Has anyone seen anything like
this?
9613 1944 root Jan 11 S 154 0.00 00:00:00 6128 cmclconfd -c
10389 1944 root Jan 11 S 154 0.00 00:00:00 6128 cmclconfd -c
9794 1 oracle 10:55:57 S 154 0.00 00:00:0
217600]oracleTEST (LOCAL=NO)
1592 1 oracle Jan 11 S 154 0.00 00:00:11 217136 ora_mman_TEST
12751 1944 root Jan 11 S 154 0.00 00:00:00 6128 cmclconfd -c
8965 1944 root Jan 11 S 154 0.00 00:00:00 6128 cmclconfd -c
11819 1 oracle Jan 12 S 154 0.00 00:00:07 217280 ora_j015_TEST
2711 1 roo
]ec 4 S 120 0.04 00:02:16 868 /usr/sbin/xntpd
3547 1 xymon Dec 4 S 168 0.00 00:00:43 268 /opt/xymon/client/bin/hobbitlaunch --config=/opt/xymon/client/etc/clientlaunch.cfg --log=/opt/xymon/client/logs/clientlaunch.log --pidfile=/opt/xymon/client/logs/clientlaunch.101.example.com.pid
3728 1 root Dec 4 R 152 0.00 00:00:37 4208 /usr/sbin/stm/uut/bin/tools/monitor/WbemWrapperMonitor
Xymon version: 4.3.0-0.beta2
Xymon server: CentOS 5.4 32 bit
Client: HP-UX 11.31 Itanium
--
Chris Naude
--
Chris Naude
--
Chris Naude
--
Chris Naude
--
Chris Naude
Odinn
2010-01-18 20:03:37 UTC
Permalink
My xymon server monitors over 1500 clients with no issues. When I see false alerts, it has always been a configuration on my part where I have 2 servers in my bb-host file using the same name on different IPs.
--


Jim Sloan


Just remember, today is the day you thought tomorrow was going to be yesterday.




________________________________
From: Chris Naude <chris.naude.0-***@public.gmane.org>
To: hobbit-pDmt/***@public.gmane.org
Sent: Mon, January 18, 2010 2:20:43 PM
Subject: Re: [hobbit] False Process Down Alerts

I've managed to stop the flood of false alerts. I removed all of my non-prod clients from the bb-hosts and shut off their client processes. The problem seems to be somehow related to the amount of data the Xymon server is trying to process.
I have 7 clients running. Each client has a different name. They are all sending data to the primary Xymon server. The alerts are reading missing processes, full file systems, and msgs errors. Here is another sample of an unusual error. You can see the process list has a funky break in it.
Sun Jan 17 15:40:18 MST 2010 - Processes NOT ok
Expected string COMMAND not found in ps output header
PID PPID USER
STIM] S PRI %CPU TIME VSZ COMMAND
0 0 root Dec 14 S 127 0.16 00:40:00 0 swapper
1 0 root Dec 14 R 152 0.09 00:01:21 2064 init
48 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached
45 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached
42 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached
31 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached
30 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached
29 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached
28 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached
26 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached
5 0 root Dec 14 R 152 0.00 00:00:02 0 signald
6 0 root Dec 14 R 152 0.00 00:00:03 0 kmemdaemon
17 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached
16 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached
15 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached
14 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached
13 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached
12 0 root Dec 14 S 152 0.00 00:00:00 0 usbhubd
11 0 root Dec 14 R 152 0.00 00:01:11 0 escsid
10 0 root Dec 14 S -32 0.00 00:00:00 0 ttisr
9 0 root Dec 14 R 152 0.00 00:01:27 0 ksyncer_daemon
7 0]root Dec 14 R 152
0.00 00:]0:00 0 kai_daemon
50 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached
47 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached
44 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached
41 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached
Is there only one client sending data as this name? I don't think you answered Lars' email.
What does the alert read and what does the data say? Missing process? Too high of a load?
Josh Luthman
Office: 937-552-2340
Direct: 937-552-2343
1100 Wayne St
Suite 1337
Troy, OH 45373
"The secret to creativity is knowing how to hide your sources."
--- Albert Einstein
The problem has suddenly become much much worse. I verified with tcpdump that the data coming from the client is 100% correct. It seems something on the Xymon server side is not handling the client data correctly. Anyone have any other ideas?
89% /testdb3 (37771472% used) has reached the PANIC level (95%)
Filesystem 1024-blocks Used Available Capacity Mounted on
/dev/vgtestdb1/lvol1 107844344 70901816 36942528 66% /testdb1
/dev/vgtestdb2/lvol1 35962064 25453128 10508936 71% /testdb2
/dev/vgtestdb4/lvol1 970909400 825006344 145903056 85% /testdb4
/dev/vgtestdb3/lv
l1 ] 338788224 301016752 37771472 89% /testdb3
/dev/vgtestdb5/lvol1 179789048 150553912 29235136 84% /testdb5
/dev/vg00/lvol8 24580711 74501 24506210 1% /home
/dev/vg00/lvol4 10226680 6339283 3887397 62% /opt
Post by Chris Naude
Post by Lars Ebeling
Post by Lars Ebeling
That makes a lot of sense. I did have some issues with the startup scripts on HP-UX. I'll check it out later tonight. Hopefully i can get it fixed before it goes live tonight. Thanks!
It looks like two instances of the client are
writing to the file at the same time or almost ;)
Lars
----- Original Message -----
Post by Lars Ebeling
From: Chris
Naude
Sent: Saturday, January 16, 2010 4:59
AM
Subject: [hobbit] False Process Down
Alerts
I'm run into a strange problem with my Xymon server. I noticed
today that I'm receiving random false alerts for processes being down. When I
look at the process list output in the alert it looks as if the data coming
from the clients isn't correct. Here is an example. Has anyone seen anything
like this?
9613 1944 root Jan 11 S 154 0.00 00:00:00 6128 cmclconfd -c
10389 1944 root Jan 11 S 154 0.00 00:00:00 6128 cmclconfd -c
9794 1 oracle 10:55:57 S 154 0.00 00:00:0
217600]oracleTEST (LOCAL=NO)
1592 1 oracle Jan 11 S 154 0.00 00:00:11 217136 ora_mman_TEST
12751 1944 root Jan 11 S 154 0.00 00:00:00 6128 cmclconfd -c
8965 1944 root Jan 11 S 154 0.00 00:00:00 6128 cmclconfd -c
11819 1 oracle Jan 12 S 154 0.00 00:00:07 217280 ora_j015_TEST
2711 1 roo
]ec 4 S 120 0.04 00:02:16 868 /usr/sbin/xntpd
3547 1 xymon Dec 4 S 168 0.00 00:00:43 268 /opt/xymon/client/bin/hobbitlaunch --config=/opt/xymon/client/etc/clientlaunch.cfg --log=/opt/xymon/client/logs/clientlaunch.log --pidfile=/opt/xymon/client/logs/clientlaunch.101.example.com.pid
3728 1 root Dec 4 R 152 0.00 00:00:37 4208 /usr/sbin/stm/uut/bin/tools/monitor/WbemWrapperMonitor
Xymon version: 4.3.0-0.beta2
Xymon server: CentOS 5.4 32 bit
Client: HP-UX 11.31 Itanium
--
Chris Naude
--
Chris Naude
--
Chris Naude
--
Chris Naude
--
Chris Naude
Williams, Doug (Consultant-RIC)
2010-01-18 19:41:23 UTC
Permalink
Seems to me your clients data is being truncated. Try modifying this in
your hobbitserver.cfg. You may want to set them appropriate size for
your xymon server. I have xymon running on pretty beefy servers so I
set these incredibly high, and even though they may exceed what xymon
actually allows (but it is not hurting me). Restart hobbit server after
making change to hobbitserver.cfg



MAXMSG_STATUS=30000000
MAXMSG_CLIENT=30000000
MAXMSG_DATA=30000000


-----Original Message-----
From: Chris Naude [mailto:chris.naude.0-***@public.gmane.org]
Sent: Monday, January 18, 2010 2:21 PM
To: hobbit-pDmt/***@public.gmane.org
Subject: Re: [hobbit] False Process Down Alerts

I've managed to stop the flood of false alerts. I removed all of my
non-prod clients from the bb-hosts and shut off their client processes.
The problem seems to be somehow related to the amount of data the Xymon
server is trying to process.


On Sun, Jan 17, 2010 at 5:08 PM, Chris Naude <chris.naude.0-***@public.gmane.org>
wrote:


I have 7 clients running. Each client has a different name. They
are all sending data to the primary Xymon server. The alerts are reading
missing processes, full file systems, and msgs errors. Here is another
sample of an unusual error. You can see the process list has a funky
break in it.


Sun Jan 17 15:40:18 MST 2010 - Processes NOT ok

yellow<Loading Image...>
Expected string COMMAND not found in ps output header

PID PPID USER
STIM] S PRI %CPU TIME VSZ COMMAND
0 0 root Dec 14 S 127 0.16 00:40:00 0
swapper
1 0 root Dec 14 R 152 0.09 00:01:21 2064 init
48 0 root Dec 14 S 152 0.00 00:00:00 0
net_str_cached
45 0 root Dec 14 S 152 0.00 00:00:00 0
net_str_cached
42 0 root Dec 14 S 152 0.00 00:00:00 0
net_str_cached
31 0 root Dec 14 S 152 0.00 00:00:00 0
net_str_cached
30 0 root Dec 14 S 152 0.00 00:00:00 0
net_str_cached
29 0 root Dec 14 S 152 0.00 00:00:00 0
net_str_cached
28 0 root Dec 14 S 152 0.00 00:00:00 0
net_str_cached
26 0 root Dec 14 S 152 0.00 00:00:00 0
net_str_cached
5 0 root Dec 14 R 152 0.00 00:00:02 0
signald
6 0 root Dec 14 R 152 0.00 00:00:03 0
kmemdaemon
17 0 root Dec 14 S 152 0.00 00:00:00 0
net_str_cached
16 0 root Dec 14 S 152 0.00 00:00:00 0
net_str_cached
15 0 root Dec 14 S 152 0.00 00:00:00 0
net_str_cached
14 0 root Dec 14 S 152 0.00 00:00:00 0
net_str_cached
13 0 root Dec 14 S 152 0.00 00:00:00 0
net_str_cached
12 0 root Dec 14 S 152 0.00 00:00:00 0
usbhubd
11 0 root Dec 14 R 152 0.00 00:01:11 0
escsid
10 0 root Dec 14 S -32 0.00 00:00:00 0 ttisr
9 0 root Dec 14 R 152 0.00 00:01:27 0
ksyncer_daemon

7 0]root Dec 14 R 152
0.00 00:]0:00 0 kai_daemon
50 0 root Dec 14 S 152 0.00 00:00:00 0
net_str_cached
47 0 root Dec 14 S 152 0.00 00:00:00 0
net_str_cached
44 0 root Dec 14 S 152 0.00 00:00:00 0
net_str_cached
41 0 root Dec 14 S 152 0.00 00:00:00 0
net_str_cached

On Sun, Jan 17, 2010 at 4:21 PM, Josh Luthman
<josh-hPvxz4IaMr62QFlZL/IlyMI/UQi/***@public.gmane.org> wrote:


Is there only one client sending data as this name? I
don't think you answered Lars' email.

What does the alert read and what does the data say?
Missing process? Too high of a load?

Josh Luthman
Office: 937-552-2340
Direct: 937-552-2343
1100 Wayne St
Suite 1337
Troy, OH 45373

"The secret to creativity is knowing how to hide your
sources."
--- Albert Einstein



On Sun, Jan 17, 2010 at 6:11 PM, Chris Naude
<chris.naude.0-***@public.gmane.org> wrote:


The problem has suddenly become much much worse.
I verified with tcpdump that the data coming from the client is 100%
correct. It seems something on the Xymon server side is not handling the
client data correctly. Anyone have any other ideas?

red 89% /testdb3 (37771472% used) has
reached the PANIC level (95%)

Filesystem 1024-blocks Used
Available Capacity Mounted on
/dev/vgtestdb1/lvol1 107844344 70901816
36942528 66% /testdb1
/dev/vgtestdb2/lvol1 35962064 25453128
10508936 71% /testdb2
/dev/vgtestdb4/lvol1 970909400 825006344
145903056 85% /testdb4
/dev/vgtestdb3/lv
l1 ] 338788224 301016752 37771472 89%
/testdb3
/dev/vgtestdb5/lvol1 179789048 150553912
29235136 84% /testdb5
/dev/vg00/lvol8 24580711 74501 24506210
1% /home
/dev/vg00/lvol4 10226680 6339283 3887397
62% /opt


On Sat, Jan 16, 2010 at 10:44 AM, Chris Naude
<chris.naude.0-***@public.gmane.org> wrote:


That makes a lot of sense. I did have
some issues with the startup scripts on HP-UX. I'll check it out later
tonight. Hopefully i can get it fixed before it goes live tonight.
Thanks!


On Sat, Jan 16, 2010 at 7:56 AM, Lars
Ebeling <lars.ebeling-***@public.gmane.org> wrote:


It looks like two instances of
the client are writing to the file at the same time or almost ;)


Lars

----- Original Message
-----
From: Chris Naude
<mailto:chris.naude.0-***@public.gmane.org>
To: hobbit-pDmt/***@public.gmane.org
Sent: Saturday, January
16, 2010 4:59 AM
Subject: [hobbit] False
Process Down Alerts

I'm run into a strange
problem with my Xymon server. I noticed today that I'm receiving random
false alerts for processes being down. When I look at the process list
output in the alert it looks as if the data coming from the clients
isn't correct. Here is an example. Has anyone seen anything like this?

9613 1944 root
Jan 11 S 154 0.00 00:00:00 6128 cmclconfd -c
10389 1944 root
Jan 11 S 154 0.00 00:00:00 6128 cmclconfd -c
9794 1 oracle
10:55:57 S 154 0.00 00:00:0
217600]oracleTEST
(LOCAL=NO)
1592 1 oracle
Jan 11 S 154 0.00 00:00:11 217136 ora_mman_TEST
12751 1944 root
Jan 11 S 154 0.00 00:00:00 6128 cmclconfd -c
8965 1944 root
Jan 11 S 154 0.00 00:00:00 6128 cmclconfd -c

11819 1 oracle
Jan 12 S 154 0.00 00:00:07 217280 ora_j015_TEST
2711 1 roo
]ec 4 S 120
0.04 00:02:16 868 /usr/sbin/xntpd
3547 1 xymon
Dec 4 S 168 0.00 00:00:43 268 /opt/xymon/client/bin/hobbitlaunch
--config=/opt/xymon/client/etc/clientlaunch.cfg
--log=/opt/xymon/client/logs/clientlaunch.log
--pidfile=/opt/xymon/client/logs/clientlaunch.101.example.com.pid
3728 1 root
Dec 4 R 152 0.00 00:00:37 4208
/usr/sbin/stm/uut/bin/tools/monitor/WbemWrapperMonitor


Xymon version:
4.3.0-0.beta2
Xymon server: CentOS 5.4
32 bit

Client: HP-UX 11.31
Itanium

--
Chris Naude





--
Chris Naude





--
Chris Naude






--
Chris Naude
--
Chris Naude


To unsubscribe from the hobbit list, send an e-mail to
hobbit-unsubscribe-pDmt/***@public.gmane.org
Chris Naude
2010-01-19 00:46:54 UTC
Permalink
I never received any alerts about messages being truncated. After disabling
the non prod clients i started receiving alerts about the messages being
truncated. I adjusted these values as specified below and they are good now.
Tomorrow i'll enable the non prod servers again and see if this is what the
original culprit was. Thanks!



On Mon, Jan 18, 2010 at 12:41 PM, Williams, Doug (Consultant-RIC) <
Post by Williams, Doug (Consultant-RIC)
Seems to me your clients data is being truncated. Try modifying this in
your hobbitserver.cfg. You may want to set them appropriate size for
your xymon server. I have xymon running on pretty beefy servers so I
set these incredibly high, and even though they may exceed what xymon
actually allows (but it is not hurting me). Restart hobbit server after
making change to hobbitserver.cfg
MAXMSG_STATUS=30000000
MAXMSG_CLIENT=30000000
MAXMSG_DATA=30000000
-----Original Message-----
Sent: Monday, January 18, 2010 2:21 PM
Subject: Re: [hobbit] False Process Down Alerts
I've managed to stop the flood of false alerts. I removed all of my
non-prod clients from the bb-hosts and shut off their client processes.
The problem seems to be somehow related to the amount of data the Xymon
server is trying to process.
I have 7 clients running. Each client has a different name. They
are all sending data to the primary Xymon server. The alerts are reading
missing processes, full file systems, and msgs errors. Here is another
sample of an unusual error. You can see the process list has a funky
break in it.
Sun Jan 17 15:40:18 MST 2010 - Processes NOT ok
yellow<http://unixadmin.bestwestern.com/xymon/gifs/yellow.gif>
Expected string COMMAND not found in ps output header
PID PPID USER
STIM] S PRI %CPU TIME VSZ COMMAND
0 0 root Dec 14 S 127 0.16 00:40:00 0 swapper
1 0 root Dec 14 R 152 0.09 00:01:21 2064 init
48 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached
45 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached
42 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached
31 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached
30 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached
29 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached
28 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached
26 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached
5 0 root Dec 14 R 152 0.00 00:00:02 0 signald
6 0 root Dec 14 R 152 0.00 00:00:03 0 kmemdaemon
17 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached
16 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached
15 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached
14 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached
13 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached
12 0 root Dec 14 S 152 0.00 00:00:00 0 usbhubd
11 0 root Dec 14 R 152 0.00 00:01:11 0 escsid
10 0 root Dec 14 S -32 0.00 00:00:00 0 ttisr
9 0 root Dec 14 R 152 0.00 00:01:27 0 ksyncer_daemon
7 0]root Dec 14 R 152
0.00 00:]0:00 0 kai_daemon
50 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached
47 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached
44 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached
41 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached
On Sun, Jan 17, 2010 at 4:21 PM, Josh Luthman
Is there only one client sending data as this name? I
don't think you answered Lars' email.
What does the alert read and what does the data say?
Missing process? Too high of a load?
Josh Luthman
Office: 937-552-2340
Direct: 937-552-2343
1100 Wayne St
Suite 1337
Troy, OH 45373
"The secret to creativity is knowing how to hide your sources."
--- Albert Einstein
On Sun, Jan 17, 2010 at 6:11 PM, Chris Naude
The problem has suddenly become much much worse.
I verified with tcpdump that the data coming from the client is 100%
correct. It seems something on the Xymon server side is not handling the
client data correctly. Anyone have any other ideas?
red 89% /testdb3 (37771472% used) has
reached the PANIC level (95%)
Filesystem 1024-blocks Used
Available Capacity Mounted on
/dev/vgtestdb1/lvol1 107844344 70901816
36942528 66% /testdb1
/dev/vgtestdb2/lvol1 35962064 25453128
10508936 71% /testdb2
/dev/vgtestdb4/lvol1 970909400 825006344
145903056 85% /testdb4
/dev/vgtestdb3/lv
l1 ] 338788224 301016752 37771472 89%
/testdb3
/dev/vgtestdb5/lvol1 179789048 150553912
29235136 84% /testdb5
/dev/vg00/lvol8 24580711 74501 24506210
1% /home
/dev/vg00/lvol4 10226680 6339283 3887397
62% /opt
On Sat, Jan 16, 2010 at 10:44 AM, Chris Naude
That makes a lot of sense. I did have
some issues with the startup scripts on HP-UX. I'll check it out later
tonight. Hopefully i can get it fixed before it goes live tonight.
Thanks!
On Sat, Jan 16, 2010 at 7:56 AM, Lars
It looks like two instances of
the client are writing to the file at the same time or almost ;)
Lars
----- Original Message -----
From: Chris Naude
Sent: Saturday, January 16, 2010 4:59 AM
Subject: [hobbit] False Process Down Alerts
I'm run into a strange
problem with my Xymon server. I noticed today that I'm receiving random
false alerts for processes being down. When I look at the process list
output in the alert it looks as if the data coming from the clients
isn't correct. Here is an example. Has anyone seen anything like this?
9613 1944 root
Jan 11 S 154 0.00 00:00:00 6128 cmclconfd -c
10389 1944 root
Jan 11 S 154 0.00 00:00:00 6128 cmclconfd -c
9794 1 oracle
10:55:57 S 154 0.00 00:00:0
217600]oracleTEST (LOCAL=NO)
1592 1 oracle
Jan 11 S 154 0.00 00:00:11 217136 ora_mman_TEST
12751 1944 root
Jan 11 S 154 0.00 00:00:00 6128 cmclconfd -c
8965 1944 root
Jan 11 S 154 0.00 00:00:00 6128 cmclconfd -c
11819 1 oracle
Jan 12 S 154 0.00 00:00:07 217280 ora_j015_TEST
2711 1 roo
]ec 4 S 120
0.04 00:02:16 868 /usr/sbin/xntpd
3547 1 xymon
Dec 4 S 168 0.00 00:00:43 268 /opt/xymon/client/bin/hobbitlaunch
--config=/opt/xymon/client/etc/clientlaunch.cfg
--log=/opt/xymon/client/logs/clientlaunch.log
--pidfile=/opt/xymon/client/logs/clientlaunch.101.example.com.pid
3728 1 root
Dec 4 R 152 0.00 00:00:37 4208
/usr/sbin/stm/uut/bin/tools/monitor/WbemWrapperMonitor
4.3.0-0.beta2
Xymon server: CentOS 5.4 32 bit
Client: HP-UX 11.31 Itanium
--
Chris Naude
--
Chris Naude
--
Chris Naude
--
Chris Naude
--
Chris Naude
To unsubscribe from the hobbit list, send an e-mail to
--
Chris Naude
Stewart, Tom L.
2010-01-19 04:27:25 UTC
Permalink
I had this problem and then did the adjustment. Since then, I get a 5
minute hole in load average and a couple of other trends, even though in
the solaris systems I have no problem using the multi-cpu and zone
process without any problems. Most of the time when the hole shows up,
I will get other missing 5 minute stats exactly one hour after the first
one and then does it two or three times. I have tried to disable the
caching, but it did not make a difference. The 4.3.0-2 beta seems to be
very broken and no one knows why. Right now, I trying to determine if I
am better off with another product, since issues do not seem to be a
priority with anyone.



Tom



________________________________

From: Chris Naude [mailto:chris.naude.0-***@public.gmane.org]
Sent: Monday, January 18, 2010 6:47 PM
To: hobbit-pDmt/***@public.gmane.org
Subject: Re: [hobbit] False Process Down Alerts



I never received any alerts about messages being truncated. After
disabling the non prod clients i started receiving alerts about the
messages being truncated. I adjusted these values as specified below and
they are good now. Tomorrow i'll enable the non prod servers again and
see if this is what the original culprit was. Thanks!





On Mon, Jan 18, 2010 at 12:41 PM, Williams, Doug (Consultant-RIC)
<Doug.Williams-***@public.gmane.org> wrote:

Seems to me your clients data is being truncated. Try modifying this in
your hobbitserver.cfg. You may want to set them appropriate size for
your xymon server. I have xymon running on pretty beefy servers so I
set these incredibly high, and even though they may exceed what xymon
actually allows (but it is not hurting me). Restart hobbit server after
making change to hobbitserver.cfg



MAXMSG_STATUS=30000000
MAXMSG_CLIENT=30000000
MAXMSG_DATA=30000000



-----Original Message-----
From: Chris Naude [mailto:chris.naude.0-***@public.gmane.org]
Sent: Monday, January 18, 2010 2:21 PM
To: hobbit-pDmt/***@public.gmane.org
Subject: Re: [hobbit] False Process Down Alerts

I've managed to stop the flood of false alerts. I removed all of my
non-prod clients from the bb-hosts and shut off their client processes.
The problem seems to be somehow related to the amount of data the Xymon
server is trying to process.


On Sun, Jan 17, 2010 at 5:08 PM, Chris Naude <chris.naude.0-***@public.gmane.org>
wrote:


I have 7 clients running. Each client has a different name. They
are all sending data to the primary Xymon server. The alerts are reading
missing processes, full file systems, and msgs errors. Here is another
sample of an unusual error. You can see the process list has a funky
break in it.


Sun Jan 17 15:40:18 MST 2010 - Processes NOT ok

yellow<http://unixadmin.bestwestern.com/xymon/gifs/yellow.gif>

Expected string COMMAND not found in ps output header

PID PPID USER
STIM] S PRI %CPU TIME VSZ COMMAND
0 0 root Dec 14 S 127 0.16 00:40:00 0
swapper
1 0 root Dec 14 R 152 0.09 00:01:21 2064 init
48 0 root Dec 14 S 152 0.00 00:00:00 0
net_str_cached
45 0 root Dec 14 S 152 0.00 00:00:00 0
net_str_cached
42 0 root Dec 14 S 152 0.00 00:00:00 0
net_str_cached
31 0 root Dec 14 S 152 0.00 00:00:00 0
net_str_cached
30 0 root Dec 14 S 152 0.00 00:00:00 0
net_str_cached
29 0 root Dec 14 S 152 0.00 00:00:00 0
net_str_cached
28 0 root Dec 14 S 152 0.00 00:00:00 0
net_str_cached
26 0 root Dec 14 S 152 0.00 00:00:00 0
net_str_cached
5 0 root Dec 14 R 152 0.00 00:00:02 0
signald
6 0 root Dec 14 R 152 0.00 00:00:03 0
kmemdaemon
17 0 root Dec 14 S 152 0.00 00:00:00 0
net_str_cached
16 0 root Dec 14 S 152 0.00 00:00:00 0
net_str_cached
15 0 root Dec 14 S 152 0.00 00:00:00 0
net_str_cached
14 0 root Dec 14 S 152 0.00 00:00:00 0
net_str_cached
13 0 root Dec 14 S 152 0.00 00:00:00 0
net_str_cached
12 0 root Dec 14 S 152 0.00 00:00:00 0
usbhubd
11 0 root Dec 14 R 152 0.00 00:01:11 0
escsid
10 0 root Dec 14 S -32 0.00 00:00:00 0 ttisr
9 0 root Dec 14 R 152 0.00 00:01:27 0
ksyncer_daemon

7 0]root Dec 14 R 152
0.00 00:]0:00 0 kai_daemon
50 0 root Dec 14 S 152 0.00 00:00:00 0
net_str_cached
47 0 root Dec 14 S 152 0.00 00:00:00 0
net_str_cached
44 0 root Dec 14 S 152 0.00 00:00:00 0
net_str_cached
41 0 root Dec 14 S 152 0.00 00:00:00 0
net_str_cached

On Sun, Jan 17, 2010 at 4:21 PM, Josh Luthman
<josh-hPvxz4IaMr62QFlZL/IlyMI/UQi/***@public.gmane.org> wrote:


Is there only one client sending data as this name? I
don't think you answered Lars' email.

What does the alert read and what does the data say?
Missing process? Too high of a load?

Josh Luthman
Office: 937-552-2340
Direct: 937-552-2343
1100 Wayne St
Suite 1337
Troy, OH 45373

"The secret to creativity is knowing how to hide your
sources."
--- Albert Einstein



On Sun, Jan 17, 2010 at 6:11 PM, Chris Naude
<chris.naude.0-***@public.gmane.org> wrote:


The problem has suddenly become much much worse.
I verified with tcpdump that the data coming from the client is 100%
correct. It seems something on the Xymon server side is not handling the
client data correctly. Anyone have any other ideas?

red 89% /testdb3 (37771472% used) has
reached the PANIC level (95%)

Filesystem 1024-blocks Used
Available Capacity Mounted on
/dev/vgtestdb1/lvol1 107844344 70901816
36942528 66% /testdb1
/dev/vgtestdb2/lvol1 35962064 25453128
10508936 71% /testdb2
/dev/vgtestdb4/lvol1 970909400 825006344
145903056 85% /testdb4
/dev/vgtestdb3/lv
l1 ] 338788224 301016752 37771472 89%
/testdb3
/dev/vgtestdb5/lvol1 179789048 150553912
29235136 84% /testdb5
/dev/vg00/lvol8 24580711 74501 24506210
1% /home
/dev/vg00/lvol4 10226680 6339283 3887397
62% /opt


On Sat, Jan 16, 2010 at 10:44 AM, Chris Naude
<chris.naude.0-***@public.gmane.org> wrote:


That makes a lot of sense. I did have
some issues with the startup scripts on HP-UX. I'll check it out later
tonight. Hopefully i can get it fixed before it goes live tonight.
Thanks!


On Sat, Jan 16, 2010 at 7:56 AM, Lars
Ebeling <lars.ebeling-***@public.gmane.org> wrote:


It looks like two instances of
the client are writing to the file at the same time or almost ;)


Lars

----- Original Message
-----
From: Chris Naude

<mailto:chris.naude.0-***@public.gmane.org>

To: hobbit-pDmt/***@public.gmane.org
Sent: Saturday, January
16, 2010 4:59 AM
Subject: [hobbit] False
Process Down Alerts

I'm run into a strange
problem with my Xymon server. I noticed today that I'm receiving random
false alerts for processes being down. When I look at the process list
output in the alert it looks as if the data coming from the clients
isn't correct. Here is an example. Has anyone seen anything like this?

9613 1944 root
Jan 11 S 154 0.00 00:00:00 6128 cmclconfd -c
10389 1944 root
Jan 11 S 154 0.00 00:00:00 6128 cmclconfd -c
9794 1 oracle
10:55:57 S 154 0.00 00:00:0
217600]oracleTEST
(LOCAL=NO)
1592 1 oracle
Jan 11 S 154 0.00 00:00:11 217136 ora_mman_TEST
12751 1944 root
Jan 11 S 154 0.00 00:00:00 6128 cmclconfd -c
8965 1944 root
Jan 11 S 154 0.00 00:00:00 6128 cmclconfd -c

11819 1 oracle
Jan 12 S 154 0.00 00:00:07 217280 ora_j015_TEST
2711 1 roo
]ec 4 S 120
0.04 00:02:16 868 /usr/sbin/xntpd
3547 1 xymon
Dec 4 S 168 0.00 00:00:43 268 /opt/xymon/client/bin/hobbitlaunch
--config=/opt/xymon/client/etc/clientlaunch.cfg
--log=/opt/xymon/client/logs/clientlaunch.log
--pidfile=/opt/xymon/client/logs/clientlaunch.101.example.com.pid
3728 1 root
Dec 4 R 152 0.00 00:00:37 4208
/usr/sbin/stm/uut/bin/tools/monitor/WbemWrapperMonitor


Xymon version:
4.3.0-0.beta2
Xymon server: CentOS 5.4
32 bit

Client: HP-UX 11.31
Itanium

--
Chris Naude





--
Chris Naude





--
Chris Naude






--
Chris Naude





--
Chris Naude



To unsubscribe from the hobbit list, send an e-mail to
hobbit-unsubscribe-pDmt/***@public.gmane.org
--
Chris Naude
Loading...