#151080 ippl stops logging packets, processes sleep

Package:
ippl
Source:
ippl
Description:
IP protocols logger
Submitter:
Matthew Astley
Date:
2005-07-18 03:56:13 UTC
Severity:
normal
#151080#5
Date:
2002-06-26 15:37:33 UTC
From:
To:
ippl has been running for a few weeks, and it has stopped logging
packets. It appears to have hanged or gone to sleep.

  Jun 18 23:50:37 captain-flack ippl: ICMP message type echo request from troop.granta.internal [10.0.0.15]
  Jun 18 23:50:52 captain-flack ippl: ICMP message type echo request from troop.granta.internal [10.0.0.15]
  Jun 18 23:51:00 captain-flack ippl: netbios-ssn connection attempt from troop.granta.internal [10.0.0.15]
  Jun 18 23:51:03 captain-flack ippl: imap2 connection attempt from troop.granta.internal [10.0.0.15]
  Jun 18 23:51:05 captain-flack ippl: smtp connection attempt from troop.granta.internal [10.0.0.15]
  Jun 18 23:51:07 captain-flack ippl: ICMP message type echo request from troop.granta.internal [10.0.0.15]
  Jun 18 23:51:22 captain-flack ippl: ICMP message type echo request from troop.granta.internal [10.0.0.15]
  Jun 18 23:51:37 captain-flack ippl: ICMP message type echo request from troop.granta.internal [10.0.0.15]
  Jun 18 23:51:52 captain-flack ippl: ICMP message type echo request from troop.granta.internal [10.0.0.15]
  Jun 18 23:51:58 captain-flack ippl: ssh connection attempt from (--me at home--)

After this, silence for days until I noticed. The regular pings and
probes from the 'mon' on troop continue as normal but are not logged.

Last time this happened, a few weeks ago, the last ippl event was an
ftp probe from somewhere, but there is no ftp server here.

All the network interfaces attached and in use (eth0, eth1, ippp0, lo,
sl0) appear to be functioning normally.

Until recently, ippl had run reliably for months. I had changed the
config file (details below), and I believe I did this before the first
hang .. but the logs have expired so I don't know for sure. I believe
my new config is sane.


`pstree -p` reports

  init(1)-+-apache(25841)-+-apache(20278)
    [...]
	  |-ippl(18629)---ippl(18630)-+-ippl(18631)
	  |                           |-ippl(18632)
	  |                           `-ippl(18633)
    [...]

This looks normal. `ps lxa` reports

    F   UID   PID  PPID PRI  NI   VSZ  RSS WCHAN  STAT TTY        TIME COMMAND
  [...]
  040     0 18629     1   9   0  7820  688 rt_sig S    ?          0:00 /usr/sbin/ippl
  040     0 18630 18629   9   0  7820  688 poll   S    ?          0:00 /usr/sbin/ippl
  140 65534 18631 18630   9   0  7820  688 rt_sig S    ?          0:19 /usr/sbin/ippl
  140 65534 18632 18630   9   0  7820  688 rt_sig S    ?          2:04 /usr/sbin/ippl
  140 65534 18633 18630   9   0  7820  688 rt_sig S    ?          0:30 /usr/sbin/ippl
  [...]

I don't know what 'rt_sig' is, and Google isn't being very helpful. I
can investigate further but I'm hoping the cause will be more obvious
to someone who knows the code.


Running strace on the lot (I think I can do this, I'm not a great
strace user),

  matthew@captain-flack:~$ sudo strace -p 18629 -p 18630 -p 18631 -p 18632 -p 18633
  attach: ptrace(PTRACE_ATTACH, ...): Operation not permitted
  [pid 18633] --- SIGSTOP (Stopped (signal)) ---
  [pid 18630] --- SIGSTOP (Stopped (signal)) ---

(I believe these stop states were caused by strace sessions from
moments before, which I then killed)

  [pid 18630] getppid()                   = 18629
  [pid 18630] poll([{fd=5, events=POLLIN}], 1, 2000) = 0
  [pid 18630] getppid()                   = 18629
  [pid 18630] poll([{fd=5, events=POLLIN}], 1, 2000) = 0
  [pid 18630] getppid()                   = 18629
  [pid 18630] poll([{fd=5, events=POLLIN}], 1, 2000) = 0
  [pid 18630] getppid()                   = 18629
  [pid 18630] poll([{fd=5, events=POLLIN}], 1, 2000) = 0
  [pid 18630] getppid()                   = 18629
  [pid 18630] poll([{fd=5, events=POLLIN}], 1, 2000) = 0
  [pid 18630] getppid()                   = 18629
  [pid 18630] poll([{fd=5, events=POLLIN}], 1, 2000) = 0
  [pid 18630] getppid()                   = 18629

The other pids are quiet, even when I strace them directly by
themselves.

fd 5 is the first end of a pipe, according to /proc/18630/fd .


In the interest of making something happen, I started throwing signals
at all the processes.

`sudo kill -HUP 18630 18631 18632 18633 18629` at first caused an
earlier `strace -p 18629` to report "--- SIGHUP (Hangup) ---" and
exit. Oops, I thought they'd all gone away already.

Further -HUPs did nothing, nor did -0 or -CONT. -ALRM got a response:

[...]
  [pid 18630] poll([{fd=5, events=POLLIN}], 1, 2000) = 0
  [pid 18630] getppid()                   = 18629
  [pid 18630] poll( <unfinished ...>
  [pid 18632] --- SIGALRM (Alarm clock) ---
  [pid 18632] +++ killed by SIGALRM +++
  [pid 18630] <... poll resumed> [{fd=5, events=POLLIN}], 1, 2000) = 0
  [pid 18630] getppid()                   = 1
  [pid 18630] kill(18633, SIGKILL <unfinished ...>
  [pid 18633] +++ killed by SIGKILL +++
  [pid 18630] <... kill resumed> )        = 0
  [pid 18630] kill(18632, SIGKILL)        = 0
  [pid 18630] kill(18631, SIGKILL <unfinished ...>
  [pid 18631] +++ killed by SIGKILL +++
  <... kill resumed> )                    = 0
  _exit(0)                                = ?

and the ippl processes have all gone away.

I've restarted it and it now works:

    F   UID   PID  PPID PRI  NI   VSZ  RSS WCHAN  STAT TTY        TIME COMMAND
  [...]
  040     0 31369     1   9   0  7816  888 nanosl S    ?          0:00 /usr/sbin/ippl
  040     0 31370 31369   9   0  7816  888 poll   S    ?          0:00 /usr/sbin/ippl
  140 65534 31371 31370   9   0  7816  888 wait_f S    ?          0:00 /usr/sbin/ippl
  140 65534 31372 31370   9   0  7816  888 wait_f S    ?          0:00 /usr/sbin/ippl
  140 65534 31373 31370   9   0  7816  888 wait_f S    ?          0:00 /usr/sbin/ippl


Reportbug didn't include the config file, not sure why.

  matthew@captain-flack:~$ grep -vE '^#' /etc/ippl.conf
  resolve all
  run all
  ignore icmp type echo_reply
  log options ident,resolve tcp port telnet
  ignore udp port domain
  ignore udp srcport domain
  ignore udp to 10.0.0.255
  ignore udp port 9163
  ignore tcp port pop3 from 10.0.0.0/24 to 10.0.0.2
  ignore tcp port 3128 from 10.0.0.0/24 to 10.0.0.2

(blank lines removed)


Comparing against the backup config,

  matthew@captain-flack:~$ grep -vE '^#' /etc/ippl.conf | perl -ne 'print if /\S/' > /tmp/conf
  matthew@captain-flack:~$ grep -vE '^#' /etc/ippl.conf~ | perl -ne 'print if /\S/' > /tmp/conf~
  matthew@captain-flack:~$ touch -r /etc/ippl.conf~ /tmp/conf~
  matthew@captain-flack:~$ touch -r /etc/ippl.conf /tmp/conf
  matthew@captain-flack:~$ diff -u /tmp/conf{~,}
  --- /tmp/conf~  Tue Jan  1 10:51:08 2002
  +++ /tmp/conf   Wed May 15 14:50:30 2002
  @@ -1,3 +1,10 @@
  -run icmp tcp
  +resolve all
  +run all
   ignore icmp type echo_reply
   log options ident,resolve tcp port telnet
  +ignore udp port domain
  +ignore udp srcport domain
  +ignore udp to 10.0.0.255
  +ignore udp port 9163
  +ignore tcp port pop3 from 10.0.0.0/24 to 10.0.0.2
  +ignore tcp port 3128 from 10.0.0.0/24 to 10.0.0.2

#151080#10
Date:
2002-06-27 17:33:19 UTC
From:
To:
Hello!

Thanks for your very good report, but I can't do anything else, then
forwarding your bugreport to the upstream and ask him to talk with you,
since this seems an upstream problem, and the upstream version is
1.4.14, which is the current debian version too. If you or the upstream
can provide a patch, I will happily upload a new debian version, until
the next upstream release fixes this error.

This message will be cc'd to the upstream, and in a next mail I will
send him the bugreport.

Thanks for the bugreport,
RISKO Gergely

#151080#15
Date:
2002-07-01 11:15:18 UTC
From:
To:
(ippl stops logging, all processes in 'S'leep state)


Another snippet of information: this time it stopped, I did
 /etc/init.d/ippl restart

and it left a zombie process, apparently the second child, the parent
of the three threads.

Before the restart, with no output from ippl:

  16909 ?        S      0:00 /usr/sbin/ippl
  16910 ?        S      0:00 /usr/sbin/ippl
  16911 ?        S      0:01 /usr/sbin/ippl
  16912 ?        S      0:10 /usr/sbin/ippl
  16913 ?        S      0:02 /usr/sbin/ippl

(I forgot to check what WCHAN state they were in, sorry)

  # /etc/init.d/ippl restart
  Stopping IP Protocols Logger.
  Starting IP Protocols Logger: ippl.

  $ ps ax | grep ippl
  16909 ?        S      0:00 /usr/sbin/ippl
  16910 ?        Z      0:00 [ippl <defunct>]
  22041 ?        S      0:00 /usr/sbin/ippl
  22042 ?        S      0:00 /usr/sbin/ippl
  22043 ?        S      0:00 /usr/sbin/ippl
  22044 ?        S      0:00 /usr/sbin/ippl
  22045 ?        S      0:00 /usr/sbin/ippl
  22058 pts/1    S      0:00 grep ippl

  $ pstree -p | grep ippl
        |-ippl(16909)---ippl(16910)
        |-ippl(22041)---ippl(22042)-+-ippl(22043)
        |                           |-ippl(22044)
        |                           `-ippl(22045)

The first (dead) set goes away if I 'kill -ALRM 16909'. I picked ALRM
because it laid waste to the lot of them last time, and because it may
be more informative that -9.

So, my random guess is that something is blocking in the top level
parent? My disks aren't full, not that syslog cares AFAIK.

I'll get the code out and have a look if it bugs me again. Having said
this, I now have a cronjob to restart every once in a while. Let's
watch the zombies pile up...


Hmm, interesting. The current pidfile has '22041' in it, presumably
this is always the parent thread, so whatever signal the Debian

 start-stop-daemon --stop --quiet --oknodo --pidfile $PIDFILE

command sends, it isn't doing the job quite right. Odd that all the
kiddies die off though.


Matthew  #8-)

#151080#20
Date:
2002-07-01 14:50:25 UTC
From:
To:
Have you by chance enabled udp logging ?

If so try turning that of and see if it works for you.

Best,
Soeren.