wait-for-child-process broken for long running child processes

Discussion:

Roderic Morris

2011-04-18 22:58:37 UTC

I've come across a bug in wait-for-child-process in the
posix-processes package. If the process with the given pid hasn't died
and is long running, wait-for-child-process will start to allocate a
ridiculous amount of memory. I've had it make pretty powerful machines
unusable.

I looked into it and traced the problem to the C function
posix_waitpid() in c/posix/proc.c. It fails to handle the case where
waitpid() returns 0 (which means that there are children running, but
no statuses are available for them). In the best case, this causes it
to loop until the child process dies, pegging the cpu. Unfortunately,
there's a space leak somewhere inside the loop, so the problem is even
worse and manifests itself in the way i described.

One of the patches I've attached fixes that problem (although it
doesn't address the space leak), but uncovers a few others. First,
process-terminated-children is actually broken in the case where it's
not given an argument, but it finds a process which is being waited
on. Second, wait-for-child-process will never return in the long
running child case, unless some other code has called
make-signal-queue with sigchld as an argument. os-signal-handler isn't
called for sigchld unless that happens. I've attached another patch
for the first, but I'm not sure how to approach the second.

P.S. Is there a way to disable deadlock detection other than the
(spawn (lambda ()
; Sleep for a year
(sleep (* 1000 60 60 24 365))))
hack from the manual? I've never had it be helpful, and it's
especially annoying when doing any work with subprocesses.

-Roderic

Taylor R Campbell

2011-04-19 00:00:26 UTC

Permalink

Date: Mon, 18 Apr 2011 18:58:37 -0400
From: Roderic Morris <***@gmail.com>

I've come across a bug in wait-for-child-process in the
posix-processes package.

This is not surprising... Handling Unix subprocesses and signals is
amazingly complicated -- even without trying to handle job control.

Unfortunately, there's a space leak somewhere inside the loop, so
the problem is even worse and manifests itself in the way i
described.

Presumably this is because it makes lots of `local references' in the
foreign call, which don't get discarded until the call is done.
Perhaps an easy way around this would be to move the loop into Scheme.

Second, wait-for-child-process will never return in the long
running child case, unless some other code has called
make-signal-queue with sigchld as an argument. os-signal-handler
isn't called for sigchld unless that happens.

Looks like INITIALIZE-SIGNALS in scheme/posix/signal.scm is missing
(maybe-request-os-signal! (signal SIGCHLD)).

P.S. Is there a way to disable deadlock detection other than the
(spawn (lambda ()
; Sleep for a year
(sleep (* 1000 60 60 24 365))))
hack from the manual? I've never had it be helpful, and it's
especially annoying when doing any work with subprocesses.

If the deadlock detection thinks waiting for a signal is a deadlock,
that's a bug. Unfortunately, PLACEHOLDER-VALUE and PIPE-READ! ill
express that you're waiting for a signal. Perhaps the scheduler needs
a hook by which threads can say `I'm waiting for a signal!'.

Marcus Crestani

2011-04-19 06:16:12 UTC

Permalink

TRC> Presumably this is because it makes lots of `local references' in the
TRC> foreign call, which don't get discarded until the call is done.
TRC> Perhaps an easy way around this would be to move the loop into Scheme.

Another solution would be to release local references manually with
`s48_free_local_ref' when they are no longer needed (and the external
call does not return for a long time or at all).

--
Marcus

Roderic Morris

2011-05-03 21:13:36 UTC

Permalink

Any hope of these fixes being pushed?

-Roderic

Post by Roderic Morris
I've come across a bug in wait-for-child-process in the
posix-processes package. If the process with the given pid hasn't died
and is long running, wait-for-child-process will start to allocate a
ridiculous amount of memory. I've had it make pretty powerful machines
unusable.
I looked into it and traced the problem to the C function
posix_waitpid() in c/posix/proc.c. It fails to handle the case where
waitpid() returns 0 (which means that there are children running, but
no statuses are available for them). In the best case, this causes it
to loop until the child process dies, pegging the cpu. Unfortunately,
there's a space leak somewhere inside the loop, so the problem is even
worse and manifests itself in the way i described.
One of the patches I've attached fixes that problem (although it
doesn't address the space leak), but uncovers a few others. First,
process-terminated-children is actually broken in the case where it's
not given an argument, but it finds a process which is being waited
on. Second, wait-for-child-process will never return in the long
running child case, unless some other code has called
make-signal-queue with sigchld as an argument. os-signal-handler isn't
called for sigchld unless that happens. I've attached another patch
for the first, but I'm not sure how to approach the second.
P.S. Is there a way to disable deadlock detection other than the
(spawn (lambda ()
; Sleep for a year
(sleep (* 1000 60 60 24 365))))
hack from the manual? I've never had it be helpful, and it's
especially annoying when doing any work with subprocesses.
-Roderic

Michael Sperber

2011-05-04 06:14:00 UTC

Permalink

Post by Roderic Morris
Any hope of these fixes being pushed?

Pushed now. Sorry about the delay: The e-mail has been sitting in my
inbox, but I've been so swamped I hadn't even had a chance to look
inside.

--
Cheers =8-} Mike
Friede, Völkerverständigung und überhaupt blabla