#528818 fdm: stale lock file remains present

Package:
fdm
Source:
fdm
Description:
fetching, filtering and delivering emails
Submitter:
Ritesh Raj Sarraf
Date:
2011-07-26 14:03:11 UTC
Severity:
normal
#528818#5
Date:
2009-05-15 20:20:00 UTC
From:
To:
Hi Frank,

There seems to be a problem with the lock file in fdm.

I currently have fdm configured to fetch emails fromi a pop3 server every
10 minutes, using a cron job.

In .fdm.conf, I'm using the following to ensure that there's only one
instance of fdm running.

# Lock file path.
set lock-file "${base}/lock"


Since I run it on a laptop, which I frequently keep
connecting/disconnecting and hibernating, at some point this gets
triggered while fdm is running.
I guess that is when sometime fdm is dying and is leaving a stale lock
file.
This then leads to no further fdm processes running because they
see the lock file and assume that an fdm process is already running.
Thus I end up with receiving no email messages until I realise that fdm
has left a stale lock file. Once I manually remove it, every thing runs
as normal.

IMO fdm should handle such error conditions and remove the lock file
when such circumstances trigger.


Ritesh

#528818#10
Date:
2009-05-16 08:21:00 UTC
From:
To:
fdm will only remove the lock file if it exits normally (success or a normal
error), it will not remove it on a fatal error (which means a bug) or on
SIGKILL, and it is never going to since I want to know about these cases and
fix them. Please send me the -vvvv log of a session (don't forget to remove
passwords and anything else sensitive) when it is interrupted and leaves behind
a lock file.

#528818#15
Date:
2009-05-16 16:43:44 UTC
From:
To:
I suspected that you'd ask me to do this. :-)

No problem. I'll set it to run with -vvvv and wait for the next trigger.

Ritesh

#528818#20
Date:
2009-05-16 17:25:05 UTC
From:
To:
Okay, thanks. Let me know.
#528818#25
Date:
2009-05-22 18:56:23 UTC
From:
To:
Hi Nicholas,

The bug triggered today. I'll be sending you the log (run with -vvvv) is a
separate mail personally.

Ritesh

#528818#30
Date:
2009-05-24 15:01:18 UTC
From:
To:
Okay, this is the problem:

gmail-researchut: spamprobe receive: io: poll: Connection timed out

Your pipe process does not return in the timeout, so fdm aborts the
fetch. Unfortunately this then causes various things which sometimes end up
with fdm not exiting correctly.

fdm's management of child processes isn't always very good, particularly when
an error occurs - sometimes stuff doesn't die in the right order and it ends up
with fatal errors. The whole multiprocess idea is to allow fdm to run lots of
things in parallel and to drop privileges when running as root, but I'm not
sure it is worth it now, I should really simplify it a lot and probably forget
about using it as root.

You can see similar problems with a configuration file such as:

    set timeout 2
    account 'stdin' stdin
    match exec 'sleep 10' returns (0, ) action drop

And then eg echo|fdm -vvvv f

This diff makes several changes which hopefully should make it a bit more
robust and correctly handle the various child processes on exit.

Please test.

Index: child-deliver.c
===================================================================
RCS file: /cvsroot/fdm/fdm/child-deliver.c,v
retrieving revision 1.21
diff -u -p -r1.21 child-deliver.c
--- child-deliver.c	17 May 2009 19:20:08 -0000	1.21
+++ child-deliver.c	24 May 2009 14:56:51 -0000
@@ -61,10 +61,10 @@ child_deliver(struct child *child, struc

 	if (privsep_send(pio, &msg, &msgbuf) != 0)
 		fatalx("privsep_send error");
-	if (privsep_recv(pio, &msg, NULL) != 0)
-		fatalx("privsep_recv error");
-	if (msg.type != MSG_EXIT)
-		fatalx("unexpected message");
+	do {
+		if (privsep_recv(pio, &msg, NULL) != 0)
+			fatalx("privsep_recv error");
+	} while (msg.type != MSG_EXIT);

 #ifdef DEBUG
 	COUNTFDS(a->name);
Index: child-fetch.c
===================================================================
RCS file: /cvsroot/fdm/fdm/child-fetch.c,v
retrieving revision 1.72
diff -u -p -r1.72 child-fetch.c
--- child-fetch.c	17 May 2009 18:23:45 -0000	1.72
+++ child-fetch.c	24 May 2009 14:56:51 -0000
@@ -127,11 +127,11 @@ child_fetch(struct child *child, struct
 	log_debug3("%s: sending exit message to parent", a->name);
 	if (privsep_send(pio, &msg, NULL) != 0)
 		fatalx("privsep_send error");
-	log_debug3("%s: waiting for exit message from parent", a->name);
-	if (privsep_recv(pio, &msg, NULL) != 0)
-		fatalx("privsep_recv error");
-	if (msg.type != MSG_EXIT)
-		fatalx("unexpected message");
+	do {
+		log_debug3("%s: waiting for exit message from parent", a->name);
+		if (privsep_recv(pio, &msg, NULL) != 0)
+			fatalx("privsep_recv error");
+	} while (msg.type != MSG_EXIT);

 #ifdef DEBUG
 	COUNTFDS(a->name);
Index: child.c
===================================================================
RCS file: /cvsroot/fdm/fdm/child.c,v
retrieving revision 1.147
diff -u -p -r1.147 child.c
--- child.c	17 May 2009 19:20:08 -0000	1.147
+++ child.c	24 May 2009 14:56:51 -0000
@@ -18,6 +18,7 @@

 #include <sys/types.h>
 #include <sys/socket.h>
+#include <sys/wait.h>

 #include <unistd.h>

@@ -35,6 +36,9 @@ child_sighandler(int sig)
 	case SIGUSR1:
 		sigusr1 = 1;
 		break;
+	case SIGCHLD:
+		waitpid(WAIT_ANY, NULL, WNOHANG);
+		break;
 	case SIGTERM:
 		cleanup_purge();
 		_exit(1);
@@ -60,6 +64,7 @@ child_fork(void)
 		sigaddset(&act.sa_mask, SIGUSR1);
 		sigaddset(&act.sa_mask, SIGINT);
 		sigaddset(&act.sa_mask, SIGTERM);
+		sigaddset(&act.sa_mask, SIGCHLD);
 		act.sa_flags = SA_RESTART;

 		act.sa_handler = SIG_IGN;
@@ -75,6 +80,8 @@ child_fork(void)
 			fatal("sigaction failed");
 		if (sigaction(SIGTERM, &act, NULL) < 0)
 			fatal("sigaction failed");
+		if (sigaction(SIGCHLD, &act, NULL) < 0)
+			fatal("sigaction failed");

 		return (0);
 	default:
Index: command.c
===================================================================
RCS file: /cvsroot/fdm/fdm/command.c,v
retrieving revision 1.54
diff -u -p -r1.54 command.c
--- command.c	17 May 2009 18:23:45 -0000	1.54
+++ command.c	24 May 2009 14:56:51 -0000
@@ -132,6 +132,8 @@ cmd_start(const char *s, int flags, cons
 			fatal("signal failed");
                 if (signal(SIGUSR2, SIG_DFL) == SIG_ERR)
 			fatal("signal failed");
+                if (signal(SIGCHLD, SIG_DFL) == SIG_ERR)
+			fatal("signal failed");

 		execl(_PATH_BSHELL, "sh", "-c", s, (char *) NULL);
 		fatal("execl failed");
@@ -271,10 +273,14 @@ cmd_poll(struct cmd *cmd, char **out, ch
 		CMD_DEBUG(cmd, "polling, timeout=%d", timeout);
 		switch (io_polln(ios, 3, &io, timeout, cause)) {
 		case -1:
+			CMD_DEBUG(cmd, "poll error: %s", strerror(errno));
 			if (errno == EAGAIN)
 				break;
+			kill(cmd->pid, SIGTERM);
 			return (-1);
 		case 0:
+			CMD_DEBUG(cmd, "poll closed");
+			kill(cmd->pid, SIGTERM);
 			/*
 			 * Check for closed. It'd be nice for closed input to
 			 * be an error, but we can't tell the difference
@@ -302,6 +308,7 @@ cmd_poll(struct cmd *cmd, char **out, ch
 			break;
 		}
 	}
+	CMD_DEBUG(cmd, "poll out");

 	/*
 	 * Retrieve and return a line if possible. This must be after the
Index: fdm.c
===================================================================
RCS file: /cvsroot/fdm/fdm/fdm.c,v
retrieving revision 1.182
diff -u -p -r1.182 fdm.c
--- fdm.c	17 May 2009 19:20:08 -0000	1.182
+++ fdm.c	24 May 2009 14:56:52 -0000
@@ -44,6 +44,7 @@ const char		*malloc_options = "AFGJPRX";

 void			 sighandler(int);
 struct child		*check_children(struct children *, u_int *);
+int			 wait_children(struct children *, struct children *);

 struct conf		 conf;

@@ -181,6 +182,79 @@ check_children(struct children *children
 	return (NULL);
 }

+/* Wait for a child and deal with its exit. */
+int
+wait_children(struct children *children, struct children *dead_children)
+{
+	struct child	*child, *child2;
+	pid_t		 pid;
+	int		 status, retcode = 0;
+	u_int		 i, j;
+
+	for (;;) {
+		log_debug3("parent: waiting for children");
+		/* Wait for a child. */
+		switch (pid = waitpid(WAIT_ANY, &status, WNOHANG)) {
+		case 0:
+			return (0);
+		case -1:
+			if (errno == ECHILD)
+				return (0);
+			fatal("waitpid failed");
+		}
+
+		/* Handle the exit status. */
+		if (WIFSIGNALED(status)) {
+			retcode = 1;
+			log_debug2("parent: child %ld got signal %d",
+			    (long) pid, WTERMSIG(status));
+		} else if (!WIFEXITED(status)) {
+			retcode = 1;
+			log_debug2("parent: child %ld exited badly",
+			    (long) pid);
+		} else {
+			if (WEXITSTATUS(status) != 0)
+				retcode = 1;
+			log_debug2("parent: child %ld returned %d",
+			    (long) pid, WEXITSTATUS(status));
+		}
+
+		/* Find this child. */
+		child = NULL;
+		for (i = 0; i < ARRAY_LENGTH(children); i++) {
+			child = ARRAY_ITEM(children, i);
+			if (pid == child->pid)
+				break;
+		}
+		if (i == ARRAY_LENGTH(children)) {
+			log_debug2("parent: unidentified child %ld",
+			    (long) pid);
+			continue;
+		}
+
+		if (child->io != NULL) {
+			io_close(child->io);
+			io_free(child->io);
+			child->io = NULL;
+		}
+		ARRAY_REMOVE(children, i);
+		ARRAY_ADD(dead_children, child);
+
+		/* If this child was the parent of any others, kill them too. */
+		for (j = 0; j < ARRAY_LENGTH(children); j++) {
+			child2 = ARRAY_ITEM(children, j);
+			if (child2->parent != child)
+				continue;
+
+			log_debug2("parent: child %ld died: killing %ld",
+			    (long) child->pid, (long) child2->pid);
+			kill(child2->pid, SIGTERM);
+		}
+	}
+
+	return (retcode);
+}
+
 __dead void
 usage(void)
 {
@@ -198,7 +272,6 @@ main(int argc, char **argv)
 	enum fdmop       op = FDMOP_NONE;
 	const char	*proxy = NULL, *s;
 	char		 tmp[BUFSIZ], *ptr, *lock = NULL, *user, *home = NULL;
-	long		 n;
 	struct utsname	 un;
 	struct passwd	*pw;
 	struct stat	 sb;
@@ -207,7 +280,7 @@ main(int argc, char **argv)
 	TAILQ_HEAD(, account) actaq; /* active accounts */
 	pid_t		 pid;
 	struct children	 children, dead_children;
-	struct child	*child, *child2;
+	struct child	*child;
 	struct io       *rio;
 	struct iolist	 iol;
 	double		 tim;
@@ -576,13 +649,12 @@ main(int argc, char **argv)
 	sigaddset(&act.sa_mask, SIGUSR1);
 	sigaddset(&act.sa_mask, SIGINT);
 	sigaddset(&act.sa_mask, SIGTERM);
+	sigaddset(&act.sa_mask, SIGCHLD);
 	act.sa_flags = SA_RESTART;

 	act.sa_handler = SIG_IGN;
 	if (sigaction(SIGPIPE, &act, NULL) < 0)
 		fatal("sigaction failed");
-	if (sigaction(SIGUSR1, &act, NULL) < 0)
-		fatal("sigaction failed");
 	if (sigaction(SIGUSR2, &act, NULL) < 0)
 		fatal("sigaction failed");

@@ -597,6 +669,8 @@ main(int argc, char **argv)
 		fatal("sigaction failed");
 	if (sigaction(SIGTERM, &act, NULL) < 0)
 		fatal("sigaction failed");
+	if (sigaction(SIGCHLD, &act, NULL) < 0)
+		fatal("sigaction failed");

 	/* Check lock file. */
 	lock = conf.lock_file;
@@ -675,21 +749,35 @@ main(int argc, char **argv)
 			    (long) child->pid, a->name);
 		}

-		/* Fill the io list. */
+		/* Check children and fill the io list. */
 		ARRAY_CLEAR(&iol);
 		for (i = 0; i < ARRAY_LENGTH(&children); i++) {
 			child = ARRAY_ITEM(&children, i);
-			ARRAY_ADD(&iol, child->io);
+			if (child->io != NULL)
+				ARRAY_ADD(&iol, child->io);
 		}

 		/* Poll the io list. */
-		n = io_polln(
-		    ARRAY_DATA(&iol), ARRAY_LENGTH(&iol), &rio, INFTIM, NULL);
-		switch (n) {
-		case -1:
-			fatalx("child socket error");
-		case 0:
-			fatalx("child socket closed");
+		if (ARRAY_LENGTH(&iol) != 0) {
+			switch (io_polln(
+			    ARRAY_DATA(&iol), ARRAY_LENGTH(&iol),
+			    &rio, INFTIM, NULL)) {
+			case -1:
+			case 0:
+				for (i = 0; i < ARRAY_LENGTH(&children); i++) {
+					child = ARRAY_ITEM(&children, i);
+					if (rio != child->io)
+						continue;
+					log_debug2("parent: child %ld socket "
+					    "error", (long) child->pid);
+					kill(child->pid, SIGTERM);
+
+					io_close(child->io);
+					io_free(child->io);
+					child->io = NULL;
+				}
+				break;
+			}
 		}

 		/* Check all children for pending privsep messages. */
@@ -703,50 +791,17 @@ main(int argc, char **argv)
 				continue;

 			/* Child has said it is ready to exit, tell it to. */
+			log_debug2("parent: sending exit message to child %ld",
+			    (long) child->pid);
 			memset(&msg, 0, sizeof msg);
 			msg.type = MSG_EXIT;
 			if (privsep_send(child->io, &msg, NULL) != 0)
 				fatalx("privsep_send error");
-
-			/* Wait for the child. */
-			if (waitpid(child->pid, &status, 0) == -1)
-				fatal("waitpid failed");
-			if (WIFSIGNALED(status)) {
-				res = 1;
-				log_debug2("parent: child %ld got signal %d",
-				    (long) child->pid, WTERMSIG(status));
-			} else if (!WIFEXITED(status)) {
-				res = 1;
-				log_debug2("parent: child %ld exited badly",
-				    (long) child->pid);
-			} else {
-				if (WEXITSTATUS(status) != 0)
-					res = 1;
-				log_debug2("parent: child %ld returned %d",
-				    (long) child->pid, WEXITSTATUS(status));
-			}
-
-			io_close(child->io);
-			io_free(child->io);
-			child->io = NULL;
-
-			ARRAY_REMOVE(&children, i);
-			ARRAY_ADD(&dead_children, child);
-
-			/*
-			 * If this child was the parent of any others, kill
-			 * them too.
-			 */
-			for (i = 0; i < ARRAY_LENGTH(&children); i++) {
-				child2 = ARRAY_ITEM(&children, i);
-				if (child2->parent != child)
-					continue;
-
-				log_debug("parent: child %ld died: killing %ld",
-				    (long) child->pid, (long) child2->pid);
-				kill(child2->pid, SIGTERM);
-			}
 		}
+
+		/* Collect any dead children. */
+		if (wait_children(&children, &dead_children) != 0)
+			res = 1;
 	}
 	ARRAY_FREE(&iol);

Index: parent-deliver.c
===================================================================
RCS file: /cvsroot/fdm/fdm/parent-deliver.c,v
retrieving revision 1.10
diff -u -p -r1.10 parent-deliver.c
--- parent-deliver.c	25 Jul 2007 22:05:06 -0000	1.10
+++ parent-deliver.c	24 May 2009 14:56:53 -0000
@@ -57,10 +57,9 @@ parent_deliver(struct child *child, stru

 	/* Check if child is alive and send to it if so. */
 	child = data->child;
-	if (child->io != NULL && kill(child->pid, 0) == 0) {
-		if (privsep_send(child->io, msg, msgbuf) != 0)
-			fatalx("privsep_send error");
-	} else
+	if (child->io != NULL)
+		privsep_send(child->io, msg, msgbuf);
+	else
 		log_debug2("%s: child %ld missing", a->name, (long) child->pid);

 	mail_close(m);
Index: privsep.c
===================================================================
RCS file: /cvsroot/fdm/fdm/privsep.c,v
retrieving revision 1.11
diff -u -p -r1.11 privsep.c
--- privsep.c	25 Jul 2007 22:05:06 -0000	1.11
+++ privsep.c	24 May 2009 14:56:53 -0000
@@ -51,6 +51,8 @@ privsep_check(struct io *io)
 int
 privsep_recv(struct io *io, struct msg *msg, struct msgbuf *msgbuf)
 {
+	char	*tmpbuf;
+
 	if (msgbuf != NULL) {
 		msgbuf->buf = NULL;
 		msgbuf->len = 0;
@@ -64,13 +66,17 @@ privsep_recv(struct io *io, struct msg *
 	if (msg->size == 0)
 		return (0);

-	if (msgbuf == NULL)
-		return (-1);
-	msgbuf->len = msg->size;
-	if (io_wait(io, msgbuf->len, INFTIM, NULL) != 0)
-		return (-1);
-	if ((msgbuf->buf = io_read(io, msgbuf->len)) == NULL)
+	if (io_wait(io, msg->size, INFTIM, NULL) != 0)
 		return (-1);
+	if (msgbuf == NULL) {
+		if ((tmpbuf = io_read(io, msg->size)) == NULL)
+			return (-1);
+		xfree(tmpbuf);
+	} else {
+		if ((msgbuf->buf = io_read(io, msg->size)) == NULL)
+			return (-1);
+		msgbuf->len = msg->size;
+	}

 	return (0);
 }

#528818#35
Date:
2009-05-25 05:46:26 UTC
From:
To:
Hi Nicholas,

Can you please send it in patch format ?
That way I can easily apply it on the Debian package and test it.

Ritesh

#528818#40
Date:
2009-05-25 07:50:40 UTC
From:
To:
I copy/pasted your diff and it fails to apply. Looks like the CVS version is
too different from the latest one in Debian.
Will try to build the cvs version.

Ritesh

#528818#45
Date:
2009-05-25 08:37:47 UTC
From:
To:
Hi Nicholas,

I'm re-running fdm with the patch you provided. Will provide you the logs once
it is triggered again.

Thanks,
Ritesh

#528818#50
Date:
2009-05-25 11:14:48 UTC
From:
To:
Hi Nicholas,

Looks like the bug is found. But the stale lock was not present. fdm is dying
while fetching mails. Looks like a child process creation problem.


gmail-researchut: cat: out:  =
gmail-researchut: cat: out:
gmail-researchut: cat: out:  if [ "$1" =3D "status" ] ; then
gmail-researchut: cat: out:     # Display a status report.
gmail-researchut: cat: out: -   log "MSG" "Mounts:"
gmail-researchut: cat: out: +   log "STATUS" "Mounts:"
gmail-researchut: cat: out:     mount | sed "s/^/   /"
gmail-researchut: cat: out: -    log "MSG" " "
gmail-researchut: cat: out: -   log "MSG" "Drive power status:"
gmail-researchut: cat: out: +    log "STATUS" " "
gmail-researchut: cat: out: +   log "STATUS" "Drive power status:"
gmail-researchut: cat: out:     for disk in $HD; do
gmail-researchut: cat: out:             if [ -r $disk ]; then
gmail-researchut: cat: out:                     hdparm -C $disk 2>/dev/null |
sed "s/^/   /"
gmail-researchut: cat: out:             else
gmail-researchut: cat: out: -                   log "MSG" "   Cannot read
$disk, permission denied - $0 needs to be run=
gmail-researchut: cat: out:  as root"
gmail-researchut: cat: out: +                   log "STATUS" "   Cannot read
$disk, permission denied - $0 needs to be =
gmail-researchut: cat: out: run as root"
gmail-researchut: cat: out:             fi
gmail-researchut: cat: out:     done
gmail-researchut: cat: out: -    log "MSG" " "
gmail-researchut: cat: out: -   log "MSG" "(NOTE: drive settings affected by
Laptop Mode cannot be retrie=
gmail-researchut: cat: out: ved.)"
gmail-researchut: cat: out: +    log "STATUS" " "
gmail-researchut: cat: out: +   log "STATUS" "(NOTE: drive settings affected by
Laptop Mode cannot be ret=
gmail-researchut: cat: out: rieved.)"
gmail-researchut: cat: out:  =
gmail-researchut: cat: out:
gmail-researchut: cat: out: -    log "MSG" " "
gmail-researchut: cat: out: -   log "MSG" "Readahead states:"
gmail-researchut: cat: out: +    log "STATUS" " "
gmail-researchut: cat: out: +   log "STATUS" "Readahead states:"
gmail-researchut: cat: out:     cat /etc/mtab | while read DEV MP FST OPTS
DUMP PASS ; do
gmail-researchut: cat: out:             # skip funny stuff
gmail-researchut: cat: out:             case "$FST" in =
gmail-researchut: cat: out:
gmail-researchut: cat: out: @@ -381,29 +381,29 @@
gmail-researchut: cat: out:             esac
gmail-researchut: cat: out:             if [ -b $DEV ] ; then
gmail-researchut: cat: out:                     if [ -r $DEV ] ; then
gmail-researchut: cat: out: -                           log "MSG" "   $DEV:
$((`blockdev --getra $DEV` / 2)) kB"
gmail-researchut: cat: out: +                           log "STATUS" "   $DEV:
$((`blockdev --getra $DEV` / 2)) kB"
gmail-researchut: cat: out:                     else
gmail-researchut: cat: out: -                           log "MSG" "   Cannot
read $DEV, permission denied - $0 needs to be run=
gmail-researchut: cat: out:  as root"
gmail-researchut: cat: out: +                           log "STATUS" "
Cannot read $DEV, permission denied - $0 needs to be =
gmail-researchut: cat: out: run as root"
gmail-researchut: cat: out:                     fi
gmail-researchut: cat: out:             fi
gmail-researchut: cat: out:     done
gmail-researchut: cat: out: -    log "MSG" " "
gmail-researchut: cat: out: +    log "STATUS" " "
gmail-researchut: cat: out:     if [ -e /var/run/laptop-mode-tools/enabled ] ;
then
gmail-researchut: cat: out: -           log "MSG" "Laptop Mode Tools is
allowed to run: /var/run/laptop-mode-too=
gmail-researchut: cat: out: ls/enabled exists."
gmail-researchut: cat: out: +           log "STATUS" "Laptop Mode Tools is
allowed to run: /var/run/laptop-mode-=
gmail-researchut: cat: out: tools/enabled exists."
gmail-researchut: cat: out:     else
gmail-researchut: cat: out: -           log "ERR" "Laptop Mode Tools is NOT
allowed to run: /var/run/laptop-mode=
gmail-researchut: cat: out: -tools/enabled does not exist."
gmail-researchut: cat: out: +           log "STATUS" "Laptop Mode Tools is NOT
allowed to run: /var/run/laptop-m=
gmail-researchut: cat: out: ode-tools/enabled does not exist."
gmail-researchut: cat: out:     fi
gmail-researchut: cat: out: -    log "MSG" " "
gmail-researchut: cat: out: +    log "STATUS" " "
gmail-researchut: cat: out:     STATFILES=3D"/proc/sys/vm/laptop_mode
/proc/apm /proc/pmu/info /proc/sys/=
gmail-researchut: cat: out: vm/bdflush /proc/sys/vm/dirty_ratio
/proc/sys/fs/xfs/age_buffer /proc/sys/f=
gmail-researchut: cat: out: s/xfs/sync_interval /proc/sys/fs/xfs/lm_age_buffer
/proc/sys/fs/xfs/lm_sync=
gmail-researchut: cat: out: _interval /proc/sys/vm/pagebuf/lm_flush_age
/proc/sys/fs/xfs/xfsbufd_centis=
gmail-researchut: cat: out: ecs /proc/sys/fs/xfs/xfssyncd_centisecs
/proc/sys/vm/dirty_background_ratio=
gmail-researchut: cat: out:  /proc/sys/vm/dirty_expire_centisecs
/proc/sys/fs/xfs/age_buffer/centisecs =
gmail-researchut: cat: out: /proc/sys/vm/dirty_writeback_centisecs
/sys/devices/system/cpu/*/cpufreq/cp=
gmail-researchut: cat: out: uinfo_*_freq
/sys/devices/system/cpu/*/cpufreq/scaling_governor /proc/acpi/=
gmail-researchut: cat: out: button/lid/*/state /proc/acpi/ac_adapter/*/state
/proc/acpi/battery/*/state=
gmail-researchut: cat: out:  /sys/class/power_supply/*/online
/sys/class/power_supply/*/state"
gmail-researchut: cat: out:     for THISFILE in $STATFILES ; do
gmail-researchut: cat: out:             if [ -e "$THISFILE" ] ; then
gmail-researchut: cat: out: -                   log "MSG" "$THISFILE:"
gmail-researchut: cat: out: +                   log "STATUS" "$THISFILE:"
gmail-researchut: cat: out:                     if [ -r "$THISFILE" ] ; then
gmail-researchut: cat: out:                             cat "$THISFILE" | sed
"s/^/   /"
gmail-researchut: cat: out:                     else
gmail-researchut: cat: out: -                           log "ERR" "   Not
accessible, permission denied - $0 needs to be run a=
gmail-researchut: cat: out: s root."
gmail-researchut: cat: out: +                           log "STATUS" "   Not
accessible, permission denied - $0 needs to be ru=
gmail-researchut: cat: out: n as root."
gmail-researchut: cat: out:                     fi
gmail-researchut: cat: out: -            log "MSG" " "
gmail-researchut: cat: out: +            log "STATUS" " "
gmail-researchut: cat: out:             fi
gmail-researchut: cat: out:     done
gmail-researchut: cat: out:  =
gmail-researchut: cat: out:
gmail-researchut: cat: out:
gmail-researchut: cat: out:
gmail-researchut: cat: out:
gmail-researchut: cat: out: --
gmail-researchut: cat: out: Debian packaging
gmail-researchut: cat: out: https://code.launchpad.net/~laptop-mode-tools-
dev/laptop-mode-tools/debian
gmail-researchut: cat: out:
gmail-researchut: cat: out: You are subscribed to branch lp:~laptop-mode-
tools-dev/laptop-mode-tools/de=
gmail-researchut: cat: out: bian.
gmail-researchut: cat: out: To unsubscribe from this branch go to
https://code.launchpad.net/~laptop-mo=
gmail-researchut: cat: out: de-tools-dev/laptop-mode-tools/debian/+edit-
subscription.
gmail-researchut: cat: waitpid: No child processes
parent: deliver child 14043 started (uid 1000)
parent: waiting for children
parent: 2 children, 8 dead children
parent: got message type 2, id 0 from child 14043
gmail-researchut: got message type 2, id 9
gmail-researchut: trying (deliver) message 9
gmail-researchut: fetching error. aborted
gmail-researchut: 8 messages processed (0 kept) in 16.424 seconds (average
2.053)
gmail-researchut: finished processing. exiting
gmail-researchut: sending exit message to parent
gmail-researchut: waiting for exit message from parent
parent: sending exit message to child 14043
parent: waiting for children
parent: 2 children, 8 dead children
parent: got message type 1, id 0 from child 14034
parent: sending exit message to child 14034
parent: waiting for children
parent: child 14034 returned 1
parent: child 14034 died: killing 14043
parent: waiting for children
parent: 1 children, 9 dead children
parent: waiting for children
parent: child 14043 returned 1
parent: waiting for children
parent: finished, total time 16.427 seconds


Thanks,
Ritesh

#528818#55
Date:
2009-05-26 06:10:05 UTC
From:
To:
Okay, I have committed some parts of the diff and made a couple of other tweaks
which were sometimes causing it to hang. Please test the diff below instead.


Index: fdm.c
===================================================================
RCS file: /cvsroot/fdm/fdm/fdm.c,v
retrieving revision 1.183
diff -u -p -r1.183 fdm.c
--- fdm.c	25 May 2009 21:47:23 -0000	1.183
+++ fdm.c	26 May 2009 06:08:03 -0000
@@ -44,9 +44,11 @@ const char		*malloc_options = "AFGJPRX";

 void			 sighandler(int);
 struct child		*check_children(struct children *, u_int *);
+int			 wait_children(struct children *, struct children *);

 struct conf		 conf;

+volatile sig_atomic_t	 sigchld;
 volatile sig_atomic_t	 sigusr1;
 volatile sig_atomic_t	 sigint;
 volatile sig_atomic_t	 sigterm;
@@ -67,6 +69,9 @@ sighandler(int sig)
 	case SIGTERM:
 		sigterm = 1;
 		break;
+	case SIGCHLD:
+		sigchld = 1;
+		break;
 	}
 }

@@ -181,6 +186,79 @@ check_children(struct children *children
 	return (NULL);
 }

+/* Wait for a child and deal with its exit. */
+int
+wait_children(struct children *children, struct children *dead_children)
+{
+	struct child	*child, *child2;
+	pid_t		 pid;
+	int		 status, retcode = 0;
+	u_int		 i, j;
+
+	for (;;) {
+		log_debug3("parent: waiting for children");
+		/* Wait for a child. */
+		switch (pid = waitpid(WAIT_ANY, &status, WNOHANG)) {
+		case 0:
+			return (0);
+		case -1:
+			if (errno == ECHILD)
+				return (0);
+			fatal("waitpid failed");
+		}
+
+		/* Handle the exit status. */
+		if (WIFSIGNALED(status)) {
+			retcode = 1;
+			log_debug2("parent: child %ld got signal %d",
+			    (long) pid, WTERMSIG(status));
+		} else if (!WIFEXITED(status)) {
+			retcode = 1;
+			log_debug2("parent: child %ld exited badly",
+			    (long) pid);
+		} else {
+			if (WEXITSTATUS(status) != 0)
+				retcode = 1;
+			log_debug2("parent: child %ld returned %d",
+			    (long) pid, WEXITSTATUS(status));
+		}
+
+		/* Find this child. */
+		child = NULL;
+		for (i = 0; i < ARRAY_LENGTH(children); i++) {
+			child = ARRAY_ITEM(children, i);
+			if (pid == child->pid)
+				break;
+		}
+		if (i == ARRAY_LENGTH(children)) {
+			log_debug2("parent: unidentified child %ld",
+			    (long) pid);
+			continue;
+		}
+
+		if (child->io != NULL) {
+			io_close(child->io);
+			io_free(child->io);
+			child->io = NULL;
+		}
+		ARRAY_REMOVE(children, i);
+		ARRAY_ADD(dead_children, child);
+
+		/* If this child was the parent of any others, kill them too. */
+		for (j = 0; j < ARRAY_LENGTH(children); j++) {
+			child2 = ARRAY_ITEM(children, j);
+			if (child2->parent != child)
+				continue;
+
+			log_debug2("parent: child %ld died: killing %ld",
+			    (long) child->pid, (long) child2->pid);
+			kill(child2->pid, SIGTERM);
+		}
+	}
+
+	return (retcode);
+}
+
 __dead void
 usage(void)
 {
@@ -198,7 +276,6 @@ main(int argc, char **argv)
 	enum fdmop       op = FDMOP_NONE;
 	const char	*proxy = NULL, *s;
 	char		 tmp[BUFSIZ], *ptr, *lock = NULL, *user, *home = NULL;
-	long		 n;
 	struct utsname	 un;
 	struct passwd	*pw;
 	struct stat	 sb;
@@ -207,8 +284,8 @@ main(int argc, char **argv)
 	TAILQ_HEAD(, account) actaq; /* active accounts */
 	pid_t		 pid;
 	struct children	 children, dead_children;
-	struct child	*child, *child2;
-	struct io       *rio;
+	struct child	*child;
+	struct io       *dead_io;
 	struct iolist	 iol;
 	double		 tim;
 	struct sigaction act;
@@ -218,6 +295,7 @@ main(int argc, char **argv)
 	struct strings	 macros;
 	struct child_fetch_data *cfd;
 	struct userdata *ud;
+	sigset_t	 sigset;
 #ifdef DEBUG
 	struct rule	*r;
 	struct action	*t;
@@ -676,21 +754,29 @@ main(int argc, char **argv)
 			    (long) child->pid, a->name);
 		}

-		/* Fill the io list. */
+		/* Check children and fill the io list. */
 		ARRAY_CLEAR(&iol);
 		for (i = 0; i < ARRAY_LENGTH(&children); i++) {
 			child = ARRAY_ITEM(&children, i);
-			ARRAY_ADD(&iol, child->io);
+			if (child->io != NULL)
+				ARRAY_ADD(&iol, child->io);
 		}

 		/* Poll the io list. */
-		n = io_polln(
-		    ARRAY_DATA(&iol), ARRAY_LENGTH(&iol), &rio, INFTIM, NULL);
-		switch (n) {
-		case -1:
-			fatalx("child socket error");
-		case 0:
-			fatalx("child socket closed");
+		if (ARRAY_LENGTH(&iol) != 0) {
+			switch (io_polln(ARRAY_DATA(&iol), ARRAY_LENGTH(&iol),
+			    &dead_io, INFTIM, NULL)) {
+			case -1:
+			case 0:
+				break;
+			default:
+				dead_io = NULL;
+				break;
+			}
+		} else {
+			/* break? */
+			sigemptyset(&sigset);
+			sigsuspend(&sigset);
 		}

 		/* Check all children for pending privsep messages. */
@@ -704,48 +790,32 @@ main(int argc, char **argv)
 				continue;

 			/* Child has said it is ready to exit, tell it to. */
+			log_debug2("parent: sending exit message to child %ld",
+			    (long) child->pid);
 			memset(&msg, 0, sizeof msg);
 			msg.type = MSG_EXIT;
 			if (privsep_send(child->io, &msg, NULL) != 0)
 				fatalx("privsep_send error");
+		}

-			/* Wait for the child. */
-			if (waitpid(child->pid, &status, 0) == -1)
-				fatal("waitpid failed");
-			if (WIFSIGNALED(status)) {
-				res = 1;
-				log_debug2("parent: child %ld got signal %d",
-				    (long) child->pid, WTERMSIG(status));
-			} else if (!WIFEXITED(status)) {
-				res = 1;
-				log_debug2("parent: child %ld exited badly",
-				    (long) child->pid);
-			} else {
-				if (WEXITSTATUS(status) != 0)
-					res = 1;
-				log_debug2("parent: child %ld returned %d",
-				    (long) child->pid, WEXITSTATUS(status));
-			}
-
-			io_close(child->io);
-			io_free(child->io);
-			child->io = NULL;
-
-			ARRAY_REMOVE(&children, i);
-			ARRAY_ADD(&dead_children, child);
+		/* Collect any dead children. */
+		if (sigchld && wait_children(&children, &dead_children) != 0)
+			res = 1;
+		sigchld = 0;

-			/*
-			 * If this child was the parent of any others, kill
-			 * them too.
-			 */
+		/* Close dead buffers (no more data coming now). */
+		if (dead_io != NULL) {
 			for (i = 0; i < ARRAY_LENGTH(&children); i++) {
-				child2 = ARRAY_ITEM(&children, i);
-				if (child2->parent != child)
+				child = ARRAY_ITEM(&children, i);
+				if (dead_io != child->io)
 					continue;
-
-				log_debug("parent: child %ld died: killing %ld",
-				    (long) child->pid, (long) child2->pid);
-				kill(child2->pid, SIGTERM);
+				log_debug2("parent: child %ld socket error",
+				    (long) child->pid);
+				kill(child->pid, SIGTERM);
+
+				io_close(child->io);
+				io_free(child->io);
+				child->io = NULL;
 			}
 		}
 	}
Index: child.c
===================================================================
RCS file: /cvsroot/fdm/fdm/child.c,v
retrieving revision 1.149
diff -u -p -r1.149 child.c
--- child.c	26 May 2009 06:05:00 -0000	1.149
+++ child.c	26 May 2009 06:08:03 -0000
@@ -118,8 +118,10 @@ child_start(struct children *children, u
 	if ((child->pid = child_fork()) == 0) {
 		for (i = 0; i < ARRAY_LENGTH(children); i++) {
 			childp = ARRAY_ITEM(children, i);
-			io_close(childp->io);
-			io_free(childp->io);
+			if (childp->io != NULL) {
+				io_close(childp->io);
+				io_free(childp->io);
+			}
 		}
 		io_close(child->io);
 		io_free(child->io);

#528818#60
Date:
2009-05-26 06:24:26 UTC
From:
To:
Index: fdm.c
===================================================================
RCS file: /cvsroot/fdm/fdm/fdm.c,v
retrieving revision 1.183
diff -u -p -r1.183 fdm.c
--- fdm.c	25 May 2009 21:47:23 -0000	1.183
+++ fdm.c	26 May 2009 06:23:30 -0000
@@ -44,9 +44,12 @@ const char		*malloc_options = "AFGJPRX";

 void			 sighandler(int);
 struct child		*check_children(struct children *, u_int *);
+int			 wait_children(
+			     struct children *, struct children *, int);

 struct conf		 conf;

+volatile sig_atomic_t	 sigchld;
 volatile sig_atomic_t	 sigusr1;
 volatile sig_atomic_t	 sigint;
 volatile sig_atomic_t	 sigterm;
@@ -67,6 +70,9 @@ sighandler(int sig)
 	case SIGTERM:
 		sigterm = 1;
 		break;
+	case SIGCHLD:
+		sigchld = 1;
+		break;
 	}
 }

@@ -181,6 +187,81 @@ check_children(struct children *children
 	return (NULL);
 }

+/* Wait for a child and deal with its exit. */
+int
+wait_children(
+    struct children *children, struct children *dead_children, int no_hang)
+{
+	struct child	*child, *child2;
+	pid_t		 pid;
+	int		 status, flags, retcode = 0;
+	u_int		 i, j;
+
+	flags = no_hang ? WNOHANG : 0;
+	for (;;) {
+		log_debug3("parent: waiting for children");
+		/* Wait for a child. */
+		switch (pid = waitpid(WAIT_ANY, &status, flags)) {
+		case 0:
+			return (0);
+		case -1:
+			if (errno == ECHILD)
+				return (0);
+			fatal("waitpid failed");
+		}
+
+		/* Handle the exit status. */
+		if (WIFSIGNALED(status)) {
+			retcode = 1;
+			log_debug2("parent: child %ld got signal %d",
+			    (long) pid, WTERMSIG(status));
+		} else if (!WIFEXITED(status)) {
+			retcode = 1;
+			log_debug2("parent: child %ld exited badly",
+			    (long) pid);
+		} else {
+			if (WEXITSTATUS(status) != 0)
+				retcode = 1;
+			log_debug2("parent: child %ld returned %d",
+			    (long) pid, WEXITSTATUS(status));
+		}
+
+		/* Find this child. */
+		child = NULL;
+		for (i = 0; i < ARRAY_LENGTH(children); i++) {
+			child = ARRAY_ITEM(children, i);
+			if (pid == child->pid)
+				break;
+		}
+		if (i == ARRAY_LENGTH(children)) {
+			log_debug2("parent: unidentified child %ld",
+			    (long) pid);
+			continue;
+		}
+
+		if (child->io != NULL) {
+			io_close(child->io);
+			io_free(child->io);
+			child->io = NULL;
+		}
+		ARRAY_REMOVE(children, i);
+		ARRAY_ADD(dead_children, child);
+
+		/* If this child was the parent of any others, kill them too. */
+		for (j = 0; j < ARRAY_LENGTH(children); j++) {
+			child2 = ARRAY_ITEM(children, j);
+			if (child2->parent != child)
+				continue;
+
+			log_debug2("parent: child %ld died: killing %ld",
+			    (long) child->pid, (long) child2->pid);
+			kill(child2->pid, SIGTERM);
+		}
+	}
+
+	return (retcode);
+}
+
 __dead void
 usage(void)
 {
@@ -198,7 +279,6 @@ main(int argc, char **argv)
 	enum fdmop       op = FDMOP_NONE;
 	const char	*proxy = NULL, *s;
 	char		 tmp[BUFSIZ], *ptr, *lock = NULL, *user, *home = NULL;
-	long		 n;
 	struct utsname	 un;
 	struct passwd	*pw;
 	struct stat	 sb;
@@ -207,8 +287,8 @@ main(int argc, char **argv)
 	TAILQ_HEAD(, account) actaq; /* active accounts */
 	pid_t		 pid;
 	struct children	 children, dead_children;
-	struct child	*child, *child2;
-	struct io       *rio;
+	struct child	*child;
+	struct io       *dead_io;
 	struct iolist	 iol;
 	double		 tim;
 	struct sigaction act;
@@ -676,21 +756,29 @@ main(int argc, char **argv)
 			    (long) child->pid, a->name);
 		}

-		/* Fill the io list. */
+		/* Check children and fill the io list. */
 		ARRAY_CLEAR(&iol);
 		for (i = 0; i < ARRAY_LENGTH(&children); i++) {
 			child = ARRAY_ITEM(&children, i);
-			ARRAY_ADD(&iol, child->io);
+			if (child->io != NULL)
+				ARRAY_ADD(&iol, child->io);
 		}

 		/* Poll the io list. */
-		n = io_polln(
-		    ARRAY_DATA(&iol), ARRAY_LENGTH(&iol), &rio, INFTIM, NULL);
-		switch (n) {
-		case -1:
-			fatalx("child socket error");
-		case 0:
-			fatalx("child socket closed");
+		if (ARRAY_LENGTH(&iol) != 0) {
+			switch (io_polln(ARRAY_DATA(&iol), ARRAY_LENGTH(&iol),
+			    &dead_io, INFTIM, NULL)) {
+			case -1:
+			case 0:
+				break;
+			default:
+				dead_io = NULL;
+				break;
+			}
+		} else {
+			/* No more children. Sleep until all are waited. */
+			if (wait_children(&children, &dead_children, 0) != 0)
+				res = 1;
 		}

 		/* Check all children for pending privsep messages. */
@@ -704,48 +792,32 @@ main(int argc, char **argv)
 				continue;

 			/* Child has said it is ready to exit, tell it to. */
+			log_debug2("parent: sending exit message to child %ld",
+			    (long) child->pid);
 			memset(&msg, 0, sizeof msg);
 			msg.type = MSG_EXIT;
 			if (privsep_send(child->io, &msg, NULL) != 0)
 				fatalx("privsep_send error");
+		}

-			/* Wait for the child. */
-			if (waitpid(child->pid, &status, 0) == -1)
-				fatal("waitpid failed");
-			if (WIFSIGNALED(status)) {
-				res = 1;
-				log_debug2("parent: child %ld got signal %d",
-				    (long) child->pid, WTERMSIG(status));
-			} else if (!WIFEXITED(status)) {
-				res = 1;
-				log_debug2("parent: child %ld exited badly",
-				    (long) child->pid);
-			} else {
-				if (WEXITSTATUS(status) != 0)
-					res = 1;
-				log_debug2("parent: child %ld returned %d",
-				    (long) child->pid, WEXITSTATUS(status));
-			}
-
-			io_close(child->io);
-			io_free(child->io);
-			child->io = NULL;
-
-			ARRAY_REMOVE(&children, i);
-			ARRAY_ADD(&dead_children, child);
+		/* Collect any dead children. */
+		if (sigchld && wait_children(&children, &dead_children, 1) != 0)
+			res = 1;
+		sigchld = 0;

-			/*
-			 * If this child was the parent of any others, kill
-			 * them too.
-			 */
+		/* Close dead buffers (no more data coming now). */
+		if (dead_io != NULL) {
 			for (i = 0; i < ARRAY_LENGTH(&children); i++) {
-				child2 = ARRAY_ITEM(&children, i);
-				if (child2->parent != child)
+				child = ARRAY_ITEM(&children, i);
+				if (dead_io != child->io)
 					continue;
-
-				log_debug("parent: child %ld died: killing %ld",
-				    (long) child->pid, (long) child2->pid);
-				kill(child2->pid, SIGTERM);
+				log_debug2("parent: child %ld socket error",
+				    (long) child->pid);
+				kill(child->pid, SIGTERM);
+
+				io_close(child->io);
+				io_free(child->io);
+				child->io = NULL;
 			}
 		}
 	}
Index: child.c
===================================================================
RCS file: /cvsroot/fdm/fdm/child.c,v
retrieving revision 1.149
diff -u -p -r1.149 child.c
--- child.c	26 May 2009 06:05:00 -0000	1.149
+++ child.c	26 May 2009 06:23:30 -0000
@@ -118,8 +118,10 @@ child_start(struct children *children, u
 	if ((child->pid = child_fork()) == 0) {
 		for (i = 0; i < ARRAY_LENGTH(children); i++) {
 			childp = ARRAY_ITEM(children, i);
-			io_close(childp->io);
-			io_free(childp->io);
+			if (childp->io != NULL) {
+				io_close(childp->io);
+				io_free(childp->io);
+			}
 		}
 		io_close(child->io);
 		io_free(child->io);

#528818#65
Date:
2009-05-26 12:18:20 UTC
From:
To:
I've patched and built the test package. It is currently running. Will update
your the results once it triggers.


PS: I removed the "cat" related rules from my fdm.conf file.


Ritesh

#528818#70
Date:
2011-07-26 13:59:35 UTC
From:
To:
I didn't want to leave this unattended. I still use fdm. But I have
completely lost track of this bug.  I did see a couple of locking issues
once in a while but I just did the manual remove and re-run steps. I
will surely respond back to this bug report in case I see the problem
back again.

IIRC Part of the problem also is that this bug is not that easily
reproducible.