Post by Michael SperberThanks for looking into this!
Post by Robert RansomSee attached for a bundle which fixes wait-for-child-process. I'm no
longer convinced that changeset 0306c5a64775 was wrong, but I think my
bundle uses a cleaner approach, and it works with the current
external-event and interrupt systems.
I'm somewhat suspicious of the changes in the bundle, as it's not clear
After all, there's is special provision to *not* signal deadlock in the
root scheduler - it calls `waiting-for-external-events?', and if that
returns #t, no deadlock is assumed. (And I still think the right fix is
to handle wait the same as getaddrinfo.)
You've looked at the code in depth - did you consider this?
Before Roderic's patch, waiting-for-external-events? didn't know that
threads which were waiting on the process-id's placeholder were
waiting for an external event, just as it still doesn't know that
threads which are waiting on a signal queue are waiting for an
external event. My branch provides a way to inform
waiting-for-external-events? that threads blocked on a given
synchronization object will be alerted when an external event of some
sort occurs; this should be used for signal queues, too.
Regarding getaddrinfo, I think the Right Thing is to make the Scheme
interface to getaddrinfo fill in a placeholder, rather than using the
external event system directly as it does now. Ideally, the external
asynchronous result code would be written in such a way that both
getaddrinfo and wait-for-child-process could use it.
Post by Michael SperberPost by Robert RansomWhen I wrote that, I suspected that part of the problem with
wait-for-child-process was that it dynamically allocated a new
external-event UID for each event it wants to wait for, rather than
using one external-event UID for a whole class of events to be waited
for. The PreScheme code to handle external events seems to me to be
designed for the latter case. [...]
getaddrinfo also uses a new UID for each event, rather than one UID
for the whole class of getaddrinfo-completion events; I still suspect
that that is why multiple simultaneous calls to wait-for-child-process
broke a later call to getaddrinfo.
It was designed for both cases, but in fact the getaddrinfo code was the
original motivation for implementing it. It's needed there because
there may be multiple simultaneous active calls to getaddrinfo from
different threads, and they all need to be notified separately of
completion.
So, if that doesn't work correctly, there's a bug.
There probably is a bug. It looks like the same bug in how interrupts
are handled (when multiple interrupts of the same type arrive at about
the same time, some seem to get 'lost' or delayed) that causes my test
case for the POSIX signals package to fail. (That test case has been
disabled since I wrote it because I couldn't figure out why it wasn't
working or how to fix it.)
Someone may have to put 'bread crumbs' into the VM and RTS in order to
figure out exactly what is going wrong.
Robert Ransom