Discussion:
Changeset 0306c5a64775
Robert Ransom
2011-12-31 03:28:10 UTC
Permalink
What is changeset 0306c5a64775 intended to do?


Robert Ransom
Roderic Morris
2011-12-31 04:13:16 UTC
Permalink
If I'm looking at the right revision, it's to allow the use of wait()
without triggering deadlock detection

-Roderic
Post by Robert Ransom
What is changeset 0306c5a64775 intended to do?
Robert Ransom
Robert Ransom
2011-12-31 11:01:46 UTC
Permalink
Post by Roderic Morris
If I'm looking at the right revision, it's to allow the use of wait()
without triggering deadlock detection
It might do that, but it breaks get-address-info from the
net-addresses package (which (ab)uses external-event UIDs in the same
way). See attached for a script that hangs S48. (I was expecting the
call to get-address-info to dislodge another of the debug-message
calls, but it just wedged instead.)


Robert Ransom
Robert Ransom
2011-12-31 11:06:31 UTC
Permalink
Post by Robert Ransom
See attached for a script that hangs S48.
Oops. This time it's really attached.


Robert Ransom
Roderic Morris
2011-12-31 15:32:42 UTC
Permalink
Right, I noted that and had a similar example when I submitted the patch.

The external-events package doesn't work as advertised, but I couldn't
figure out how to fix it. It's really the only mechanism that can be
used to do both library calls correctly though.

-Roderic
Post by Robert Ransom
Post by Roderic Morris
If I'm looking at the right revision, it's to allow the use of wait()
without triggering deadlock detection
It might do that, but it breaks get-address-info from the
net-addresses package (which (ab)uses external-event UIDs in the same
way).  See attached for a script that hangs S48.  (I was expecting the
call to get-address-info to dislodge another of the debug-message
calls, but it just wedged instead.)
Robert Ransom
Robert Ransom
2012-01-01 21:41:41 UTC
Permalink
Post by Roderic Morris
Right, I noted that and had a similar example when I submitted the patch.
The external-events package doesn't work as advertised, but I couldn't
figure out how to fix it. It's really the only mechanism that can be
used to do both library calls correctly though.
See attached for a patch that fixes one clear bug in the
external-events package. That doesn't fix the problem with
wait-for-child-process.

In order to fix the general problem of incorrect deadlock detection, I
think we need:

1. a way to mark thread queues as ‘blocking on an external event’ for
deadlock-detection purposes
2. a way to mark condvars and placeholders as ‘blocking on an external
event’ (getaddrinfo and wait-for-child-process should be using
placeholders; the posix-signals package should be using condvars)
3. support for registering long-term handlers to handle every
occurrence of an external event UID
4. an as-general-as-possible ‘external asynchronous result’ system to
handle placeholders which should be filled in from C

For piece 1 (and 2), we could try iterating through a list of weak
references to thread queues to determine whether any threads are
blocked on one of the thread queues, but we might need to optimize
that data structure if many of those thread queues turn out to not
have any threads on them.


Robert Ransom
Michael Sperber
2012-06-12 18:32:57 UTC
Permalink
Post by Robert Ransom
See attached for a patch that fixes one clear bug in the
external-events package. That doesn't fix the problem with
wait-for-child-process.
In order to fix the general problem of incorrect deadlock detection, I
1. a way to mark thread queues as ‘blocking on an external event’ for
deadlock-detection purposes
2. a way to mark condvars and placeholders as ‘blocking on an external
event’ (getaddrinfo and wait-for-child-process should be using
placeholders; the posix-signals package should be using condvars)
3. support for registering long-term handlers to handle every
occurrence of an external event UID
4. an as-general-as-possible ‘external asynchronous result’ system to
handle placeholders which should be filled in from C
So I finally fixed (I hope) these issues - it took a whole bunch of
changes, along the lines of what Robert suggested:

- Threads can, when blocking on an external event or something similar,
explicitly say that the blocking does not contribute to deadlock.

- `wait-for-child-process' can now use a very simple implementation
using placeholders.

- The race with external-event uids is fixed by allocating the uids in
Scheme rather than in C. It should now be possible to build a
higher-level abstraction for this with placeholders, but I haven't
done that yet.

While the tests all succeed, and I've mulled over the changes quite a
while, are quite sensitive: Review and further testing much appreciated!
(Especially as this issue was the only one blocking the release.)

Thanks to Robert and Roderic for analyzing the problems and for the
patches - extremely helpful!
--
Cheers =8-} Mike
Friede, Völkerverständigung und überhaupt blabla
Roderic Morris
2012-06-14 12:53:19 UTC
Permalink
Great! Related tests pass in scsh as well. Code looks good as well from what I've been able to check. I'm glad to see the ball moving on this, and I'm trying to do the same with my scheme48 based projects.

One issue that you may or may not consider a release blocker though: scheme48 is almost unusably slow and resource intensive on OSX. It seems to be to happen when loading code, for instance ,opening posix takes at least a minute of 100% usage of a cpu core. Opening packages that don't use the FFI also seem to be unusually slow, though not nearly so. Even starting scheme48 has a noticeable pause. I believe it's related to the change to an llvm based compiler in the latest OSX releases. scheme48 1.8 doesn't seem to be affected though.

-Roderic
Post by Michael Sperber
Post by Robert Ransom
See attached for a patch that fixes one clear bug in the
external-events package. That doesn't fix the problem with
wait-for-child-process.
In order to fix the general problem of incorrect deadlock detection, I
1. a way to mark thread queues as ‘blocking on an external event’ for
deadlock-detection purposes
2. a way to mark condvars and placeholders as ‘blocking on an external
event’ (getaddrinfo and wait-for-child-process should be using
placeholders; the posix-signals package should be using condvars)
3. support for registering long-term handlers to handle every
occurrence of an external event UID
4. an as-general-as-possible ‘external asynchronous result’ system to
handle placeholders which should be filled in from C
So I finally fixed (I hope) these issues - it took a whole bunch of
- Threads can, when blocking on an external event or something similar,
explicitly say that the blocking does not contribute to deadlock.
- `wait-for-child-process' can now use a very simple implementation
using placeholders.
- The race with external-event uids is fixed by allocating the uids in
Scheme rather than in C. It should now be possible to build a
higher-level abstraction for this with placeholders, but I haven't
done that yet.
While the tests all succeed, and I've mulled over the changes quite a
while, are quite sensitive: Review and further testing much appreciated!
(Especially as this issue was the only one blocking the release.)
Thanks to Robert and Roderic for analyzing the problems and for the
patches - extremely helpful!
--
Cheers =8-} Mike
Friede, VölkerverstÀndigung und Ìberhaupt blabla
Michael Sperber
2012-06-14 13:47:15 UTC
Permalink
Post by Roderic Morris
Great! Related tests pass in scsh as well. Code looks good as well
from what I've been able to check. I'm glad to see the ball moving on
this, and I'm trying to do the same with my scheme48 based projects.
Thanks for trying this out!
Post by Roderic Morris
scheme48 is almost unusably slow and resource intensive on OSX. It
seems to be to happen when loading code, for instance ,opening posix
takes at least a minute of 100% usage of a cpu core. Opening packages
that don't use the FFI also seem to be unusually slow, though not
nearly so. Even starting scheme48 has a noticeable pause. I believe
it's related to the change to an llvm based compiler in the latest OSX
releases. scheme48 1.8 doesn't seem to be affected though.
Yes, there's a bug in the LLVM shipped with Mac OS X / XCode these days:
It contains a buggy prerelease of LLVM. I haven't found a way to work
around it in Scheme 48, and it really does not seem worth it: I
downloaded the clang+llvm binaries from the LLVM site, and those work
just fine.
--
Cheers =8-} Mike
Friede, Völkerverständigung und überhaupt blabla
Marcus Crestani
2012-06-14 16:49:20 UTC
Permalink
MS> Yes, there's a bug in the LLVM shipped with Mac OS X / XCode these
MS> days: It contains a buggy prerelease of LLVM. I haven't found a way
MS> to work around it in Scheme 48, and it really does not seem worth
MS> it: I downloaded the clang+llvm binaries from the LLVM site, and
MS> those work just fine.

You can work around by turning off the compiler's optimization ("-O0").
Only "-O" with >0 causes the problem for me.
--
Marcus
Roderic Morris
2012-06-14 19:58:29 UTC
Permalink
Perfect, both of those methods worked. Thanks.

-Roderic
Post by Marcus Crestani
MS> Yes, there's a bug in the LLVM shipped with Mac OS X / XCode these
MS> days: It contains a buggy prerelease of LLVM. I haven't found a way
MS> to work around it in Scheme 48, and it really does not seem worth
MS> it: I downloaded the clang+llvm binaries from the LLVM site, and
MS> those work just fine.
You can work around by turning off the compiler's optimization ("-O0").
Only "-O" with >0 causes the problem for me.
--
Marcus
Michael Sperber
2012-01-05 17:15:27 UTC
Permalink
Post by Robert Ransom
Post by Roderic Morris
If I'm looking at the right revision, it's to allow the use of wait()
without triggering deadlock detection
It might do that, but it breaks get-address-info from the
net-addresses package (which (ab)uses external-event UIDs in the same
way).
Not quite: In `net-addresses', the external event is indeed triggered
from the external code, which is not the case with the waitpid stuff,
which just trampolines into the external events from Scheme. So I
believe the solution is to do with the wait code as with the getaddrinfo
case, namely to wait in a separate thread (or revert to the old
implementation if threads are not available.) I'll look into doing this
over the weekend.

Meanwhile, thanks for the analysis and the patches (which look good) -
I've pushed those.
--
Cheers =8-} Mike
Friede, Völkerverständigung und überhaupt blabla
Robert Ransom
2012-01-06 06:20:29 UTC
Permalink
Post by Michael Sperber
Post by Robert Ransom
Post by Roderic Morris
If I'm looking at the right revision, it's to allow the use of wait()
without triggering deadlock detection
It might do that, but it breaks get-address-info from the
net-addresses package (which (ab)uses external-event UIDs in the same
way).
Not quite: In `net-addresses', the external event is indeed triggered
from the external code, which is not the case with the waitpid stuff,
which just trampolines into the external events from Scheme. So I
believe the solution is to do with the wait code as with the getaddrinfo
case, namely to wait in a separate thread (or revert to the old
implementation if threads are not available.) I'll look into doing this
over the weekend.
When I wrote that, I suspected that part of the problem with
wait-for-child-process was that it dynamically allocated a new
external-event UID for each event it wants to wait for, rather than
using one external-event UID for a whole class of events to be waited
for. The PreScheme code to handle external events seems to me to be
designed for the latter case. getaddrinfo also uses a new UID for
each event, rather than one UID for the whole class of
getaddrinfo-completion events; I still suspect that that is why
multiple simultaneous calls to wait-for-child-process broke a later
call to getaddrinfo.

See attached for a bundle which fixes wait-for-child-process. I'm no
longer convinced that changeset 0306c5a64775 was wrong, but I think my
bundle uses a cleaner approach, and it works with the current
external-event and interrupt systems.


Robert Ransom
Robert Ransom
2012-01-06 06:36:36 UTC
Permalink
Post by Robert Ransom
See attached for a bundle which fixes wait-for-child-process.
Oops. You want changeset fdd24fd71a02 (and its ancestors). See
attached for a bundle containing only the good commits.


Robert Ransom
Michael Sperber
2012-01-06 14:53:28 UTC
Permalink
Thanks for looking into this!
Post by Robert Ransom
See attached for a bundle which fixes wait-for-child-process. I'm no
longer convinced that changeset 0306c5a64775 was wrong, but I think my
bundle uses a cleaner approach, and it works with the current
external-event and interrupt systems.
I'm somewhat suspicious of the changes in the bundle, as it's not clear
to me why deadlock should get erroneously signalled before the changes:
After all, there's is special provision to *not* signal deadlock in the
root scheduler - it calls `waiting-for-external-events?', and if that
returns #t, no deadlock is assumed. (And I still think the right fix is
to handle wait the same as getaddrinfo.)

You've looked at the code in depth - did you consider this?
Post by Robert Ransom
When I wrote that, I suspected that part of the problem with
wait-for-child-process was that it dynamically allocated a new
external-event UID for each event it wants to wait for, rather than
using one external-event UID for a whole class of events to be waited
for. The PreScheme code to handle external events seems to me to be
designed for the latter case. [...]
getaddrinfo also uses a new UID for each event, rather than one UID
for the whole class of getaddrinfo-completion events; I still suspect
that that is why multiple simultaneous calls to wait-for-child-process
broke a later call to getaddrinfo.
It was designed for both cases, but in fact the getaddrinfo code was the
original motivation for implementing it. It's needed there because
there may be multiple simultaneous active calls to getaddrinfo from
different threads, and they all need to be notified separately of
completion.

So, if that doesn't work correctly, there's a bug.
--
Cheers =8-} Mike
Friede, Völkerverständigung und überhaupt blabla
Robert Ransom
2012-01-07 19:05:39 UTC
Permalink
Post by Michael Sperber
Thanks for looking into this!
Post by Robert Ransom
See attached for a bundle which fixes wait-for-child-process. I'm no
longer convinced that changeset 0306c5a64775 was wrong, but I think my
bundle uses a cleaner approach, and it works with the current
external-event and interrupt systems.
I'm somewhat suspicious of the changes in the bundle, as it's not clear
After all, there's is special provision to *not* signal deadlock in the
root scheduler - it calls `waiting-for-external-events?', and if that
returns #t, no deadlock is assumed. (And I still think the right fix is
to handle wait the same as getaddrinfo.)
You've looked at the code in depth - did you consider this?
Before Roderic's patch, waiting-for-external-events? didn't know that
threads which were waiting on the process-id's placeholder were
waiting for an external event, just as it still doesn't know that
threads which are waiting on a signal queue are waiting for an
external event. My branch provides a way to inform
waiting-for-external-events? that threads blocked on a given
synchronization object will be alerted when an external event of some
sort occurs; this should be used for signal queues, too.

Regarding getaddrinfo, I think the Right Thing is to make the Scheme
interface to getaddrinfo fill in a placeholder, rather than using the
external event system directly as it does now. Ideally, the external
asynchronous result code would be written in such a way that both
getaddrinfo and wait-for-child-process could use it.
Post by Michael Sperber
Post by Robert Ransom
When I wrote that, I suspected that part of the problem with
wait-for-child-process was that it dynamically allocated a new
external-event UID for each event it wants to wait for, rather than
using one external-event UID for a whole class of events to be waited
for. The PreScheme code to handle external events seems to me to be
designed for the latter case. [...]
getaddrinfo also uses a new UID for each event, rather than one UID
for the whole class of getaddrinfo-completion events; I still suspect
that that is why multiple simultaneous calls to wait-for-child-process
broke a later call to getaddrinfo.
It was designed for both cases, but in fact the getaddrinfo code was the
original motivation for implementing it. It's needed there because
there may be multiple simultaneous active calls to getaddrinfo from
different threads, and they all need to be notified separately of
completion.
So, if that doesn't work correctly, there's a bug.
There probably is a bug. It looks like the same bug in how interrupts
are handled (when multiple interrupts of the same type arrive at about
the same time, some seem to get 'lost' or delayed) that causes my test
case for the POSIX signals package to fail. (That test case has been
disabled since I wrote it because I couldn't figure out why it wasn't
working or how to fix it.)

Someone may have to put 'bread crumbs' into the VM and RTS in order to
figure out exactly what is going wrong.


Robert Ransom
Michael Sperber
2012-01-07 19:15:07 UTC
Permalink
Post by Robert Ransom
Before Roderic's patch, waiting-for-external-events? didn't know that
threads which were waiting on the process-id's placeholder were
waiting for an external event, just as it still doesn't know that
threads which are waiting on a signal queue are waiting for an
external event.
Why would it have to know that? And, even it were necessary to know
that, it would be easier to look at the condvar's waiters instead of
marking the queues, no?

I still feel I'm missing something important in your argument ...
Post by Robert Ransom
Regarding getaddrinfo, I think the Right Thing is to make the Scheme
interface to getaddrinfo fill in a placeholder, rather than using the
external event system directly as it does now.
OK - it still seems to me the hard part is communicating the information
from C to Scheme somehow.

But at this point I really just want to fix the deadlock issue and the
potential bug before the release (which I really, really want to happen
soon). After that, we can make bigger changes.
--
Cheers =8-} Mike
Friede, Völkerverständigung und überhaupt blabla
Robert Ransom
2012-01-07 19:41:27 UTC
Permalink
Post by Michael Sperber
Post by Robert Ransom
Before Roderic's patch, waiting-for-external-events? didn't know that
threads which were waiting on the process-id's placeholder were
waiting for an external event, just as it still doesn't know that
threads which are waiting on a signal queue are waiting for an
external event.
Why would it have to know that?
In the process-id case, threads waiting on a process ID's placeholder
are awakened by process-terminated-children, which is called by the
os-signal interrupt handler. Nothing waits on an external-event UID.

In the signal-queue case, threads waiting on a signal queue's
ready-for-read condvar are awakened by the os-signal interrupt handler
pushing a signal object into the signal queue. Nothing waits on an
external-event UID.

If the deadlock-detection code isn't told that those synchronization
objects will be alerted by (an interrupt handler triggered by) an
external event of some sort, then it will falsely report that a
deadlock has occurred.
Post by Michael Sperber
And, even it were necessary to know
that, it would be easier to look at the condvar's waiters instead of
marking the queues, no?
No, because we also need to look at placeholders' waiters, and the
simplest way to handle both condvars and placeholders (as well as
other synchronization ‘primitives’ which may be added in the future)
is to look at their underlying thread queues.
Post by Michael Sperber
I still feel I'm missing something important in your argument ...
Post by Robert Ransom
Regarding getaddrinfo, I think the Right Thing is to make the Scheme
interface to getaddrinfo fill in a placeholder, rather than using the
external event system directly as it does now.
OK - it still seems to me the hard part is communicating the information
from C to Scheme somehow.
But at this point I really just want to fix the deadlock issue and the
potential bug before the release (which I really, really want to happen
soon). After that, we can make bigger changes.
The interrupt system has been slightly broken for as long as I nave
used Scheme 48, and I haven't figured out why. I think not using
interrupts in a way that will definitely trigger that brokenness (i.e.
not flooding it with multiple interrupts of the same type at once) is
sufficient for the release.


Robert Ransom
Will Noble
2012-01-08 00:09:15 UTC
Permalink
Post by Robert Ransom
Post by Michael Sperber
Post by Robert Ransom
Before Roderic's patch, waiting-for-external-events? didn't know that
threads which were waiting on the process-id's placeholder were
waiting for an external event, just as it still doesn't know that
threads which are waiting on a signal queue are waiting for an
external event.
Why would it have to know that?
In the process-id case, threads waiting on a process ID's placeholder
are awakened by process-terminated-children, which is called by the
os-signal interrupt handler. Nothing waits on an external-event UID.
I have a slightly different take on this. Deadlock detection occurs in
ROOT-WAIT, which, besides consulting WAITING-FOR-EXTERNAL-EVENTS? also looks
at threads blocking on I/O and (obviously) threads waiting for an alarm.
Each of these three cases corresponds to a type of VM interrupt. So my idea
of what ROOT-WAIT does here is to say that, if we have reason to believe a
VM interrupt might wake a thread, then we don't need to declare deadlock.

What's missing from this picture is OS signals. The VM recognizes OS signals
as a type of interrupt, but ROOT-WAIT doesn't check if anything is waiting
on one. I assumed the reason for this is because, while the run-time system
maintains queues for the other cases, POSIX maintains sole responsibility
for signal-waiters, and RTS doesn't call POSIX.

This suggests to me that the RTS should at least keep track of the number of
threads waiting for a signal, and POSIX could increment/decrement that
number. Then there could be a function WAITING-ON-OS-SIGNAL? that ROOT-WAIT
could check.

Basically, I agree what Robert says, except that I don't think
WAITING-FOR-EXTERNAL-EVENTS? should be extended to deal with things other
than external events (in the narrow sense of the mechanism called "external
events", the one with UIDs). Instead we should just ask "are there any
threads that could wake up if we get an OS signal?" before calling deadlock.

Best,
Will
Roderic Morris
2012-01-08 06:29:54 UTC
Permalink
I actually quite like the approach Robert's patches take. It has the
advantage of expedience in this case, as it's already implemented and
works well. Moreover, it provides a much more pleasant and useful
interface (and one available in scheme, instead of through the FFI)
than the UID approach, which seems to be broken in the general case.

Will might be right in his thinking, but there's still the matter of
library writers, outside of the libraries that are included in s48,
that need to avoid deadlock when waiting on say, work being done in a
separate pthread. If the UID approach doesn't work in general, why not
give them something like what Robert's written?

-Roderic
Post by Will Noble
Post by Robert Ransom
Post by Michael Sperber
Post by Robert Ransom
Before Roderic's patch, waiting-for-external-events? didn't know that
threads which were waiting on the process-id's placeholder were
waiting for an external event, just as it still doesn't know that
threads which are waiting on a signal queue are waiting for an
external event.
Why would it have to know that?
In the process-id case, threads waiting on a process ID's placeholder
are awakened by process-terminated-children, which is called by the
os-signal interrupt handler.  Nothing waits on an external-event UID.
I have a slightly different take on this. Deadlock detection occurs in
ROOT-WAIT, which, besides consulting WAITING-FOR-EXTERNAL-EVENTS? also looks
at threads blocking on I/O and (obviously) threads waiting for an alarm.
Each of these three cases corresponds to a type of VM interrupt. So my idea
of what ROOT-WAIT does here is to say that, if we have reason to believe a
VM interrupt might wake a thread, then we don't need to declare deadlock.
What's missing from this picture is OS signals. The VM recognizes OS signals
as a type of interrupt, but ROOT-WAIT doesn't check if anything is waiting
on one. I assumed the reason for this is because, while the run-time system
maintains queues for the other cases, POSIX maintains sole responsibility
for signal-waiters, and RTS doesn't call POSIX.
This suggests to me that the RTS should at least keep track of the number of
threads waiting for a signal, and POSIX could increment/decrement that
number. Then there could be a function WAITING-ON-OS-SIGNAL? that ROOT-WAIT
could check.
Basically, I agree what Robert says, except that I don't think
WAITING-FOR-EXTERNAL-EVENTS? should be extended to deal with things other
than external events (in the narrow sense of the mechanism called "external
events", the one with UIDs). Instead we should just ask "are there any
threads that could wake up if we get an OS signal?" before calling deadlock.
Best,
Will
Michael Sperber
2012-01-08 12:05:45 UTC
Permalink
Post by Will Noble
What's missing from this picture is OS signals. The VM recognizes OS signals
as a type of interrupt, but ROOT-WAIT doesn't check if anything is waiting
on one. I assumed the reason for this is because, while the run-time system
maintains queues for the other cases, POSIX maintains sole responsibility
for signal-waiters, and RTS doesn't call POSIX.
This suggests to me that the RTS should at least keep track of the number of
threads waiting for a signal, and POSIX could increment/decrement that
number. Then there could be a function WAITING-ON-OS-SIGNAL? that ROOT-WAIT
could check.
Yup. I just remembered this from the deep past when I looked at the
signal docs in the manual - where this problem is documented. I
definitely want to fix this before the release.
--
Cheers =8-} Mike
Friede, Völkerverständigung und überhaupt blabla
Robert Ransom
2012-01-10 10:11:26 UTC
Permalink
Post by Will Noble
I have a slightly different take on this. Deadlock detection occurs in
ROOT-WAIT, which, besides consulting WAITING-FOR-EXTERNAL-EVENTS? also looks
at threads blocking on I/O and (obviously) threads waiting for an alarm.
Each of these three cases corresponds to a type of VM interrupt. So my idea
of what ROOT-WAIT does here is to say that, if we have reason to believe a
VM interrupt might wake a thread, then we don't need to declare deadlock.
What's missing from this picture is OS signals. The VM recognizes OS signals
as a type of interrupt, but ROOT-WAIT doesn't check if anything is waiting
on one. I assumed the reason for this is because, while the run-time system
maintains queues for the other cases, POSIX maintains sole responsibility
for signal-waiters, and RTS doesn't call POSIX.
This suggests to me that the RTS should at least keep track of the number of
threads waiting for a signal, and POSIX could increment/decrement that
number. Then there could be a function WAITING-ON-OS-SIGNAL? that ROOT-WAIT
could check.
Would a counter be sufficient for this? I see some use of weak
pointers in scheme/posix/signal.scm, so I really have no idea whether
we could safely use that optimization (i.e. look at a counter instead
of the thread queues themselves), even if we didn't need a more
general solution.


Robert Ransom
Will Noble
2012-01-11 02:54:30 UTC
Permalink
Post by Robert Ransom
Post by Will Noble
This suggests to me that the RTS should at least keep track of the number of
threads waiting for a signal, and POSIX could increment/decrement that
number. Then there could be a function WAITING-ON-OS-SIGNAL? that ROOT-WAIT
could check.
Would a counter be sufficient for this? I see some use of weak
pointers in scheme/posix/signal.scm, so I really have no idea whether
we could safely use that optimization (i.e. look at a counter instead
of the thread queues themselves), even if we didn't need a more
general solution.
Yes. A sleeping thread can't wait for a signal so a counter works just as
well as looking at the thread queue. It's less optimization than
convenience. With the approach I outlined in the previous message you might
equally well register the actual thread queues with the RTS and have
ROOT-WAIT go through them to see if a thread somewhere is waiting for a
signal, but a counter does the same thing.

Regarding the weak pointer usage in signal.scm: those map signal numbers to
signal representations. If, say, you have a signal 5 in the system, the
mapping returns the existing object when asked for signal 5. However, if
everyone else loses their reference to that object, you don't want the
abstract concept of "5" to keep it around forever without being
garbage-collected. It's not something that would break the counter shortcut.

Best,
Will
Robert Ransom
2012-01-14 12:43:53 UTC
Permalink
Post by Will Noble
Post by Robert Ransom
Post by Will Noble
This suggests to me that the RTS should at least keep track of the number of
threads waiting for a signal, and POSIX could increment/decrement that
number. Then there could be a function WAITING-ON-OS-SIGNAL? that ROOT-WAIT
could check.
Would a counter be sufficient for this? I see some use of weak
pointers in scheme/posix/signal.scm, so I really have no idea whether
we could safely use that optimization (i.e. look at a counter instead
of the thread queues themselves), even if we didn't need a more
general solution.
Yes. A sleeping thread can't wait for a signal so a counter works just as
well as looking at the thread queue. It's less optimization than
convenience. With the approach I outlined in the previous message you might
equally well register the actual thread queues with the RTS and have
ROOT-WAIT go through them to see if a thread somewhere is waiting for a
signal, but a counter does the same thing.
??? A thread which is waiting for a signal is sleeping. In any case,
I don't see how this is relevant to whether we can use a counter
(which will break deadlock detection permanently if one of the signal
system's thread queues is ever GCed while a thread is blocked on it)
to determine whether any threads are waiting for an OS signal.
Post by Will Noble
Regarding the weak pointer usage in signal.scm: those map signal numbers to
signal representations. If, say, you have a signal 5 in the system, the
mapping returns the existing object when asked for signal 5. However, if
everyone else loses their reference to that object, you don't want the
abstract concept of "5" to keep it around forever without being
garbage-collected. It's not something that would break the counter shortcut.
Signal queues are weakly held too. Fortunately, the
(signal-queue-signals queue) call in find-next-signal keeps a
reference to the signal queue around when a thread is blocked on the
queue, so we would get away with the counter optimization (but only by
a lucky accident).


Robert Ransom
Will Noble
2012-01-14 22:55:56 UTC
Permalink
Post by Robert Ransom
??? A thread which is waiting for a signal is sleeping. In any case,
I don't see how this is relevant to whether we can use a counter
(which will break deadlock detection permanently if one of the signal
system's thread queues is ever GCed while a thread is blocked on it)
to determine whether any threads are waiting for an OS signal.
I misunderstood your issue with the counter and ended up explaining why we
couldn't double-count, which is incredibly obvious. I hope you were not
offended ;). On the other point, it doesn't look to me like a thread queue
with a thread in it waiting for a signal could be garbage-collected...
Post by Robert Ransom
Signal queues are weakly held too. Fortunately, the
(signal-queue-signals queue) call in find-next-signal keeps a
reference to the signal queue around when a thread is blocked on the
queue, so we would get away with the counter optimization (but only by
a lucky accident).
...although I don't understand how this prevents it. The mechanism I see is
the signal mapper (from OS signal number to list of thread queues, for all
known signals). It strongly references every non-empty thread queue. So a
way to know if any thread is waiting for a signal without keeping a counter
is to iterate over the mapper looking for a non-empty queue.

(Also evidently I was talking about a different usage of weak pointers in
signal.scm than the one you referred to, so basically my entire post
explained the wrong things. Sorry!)

I don't think we really have a disagreement here. To summarize my
recommendation, I like the narrow solution where the deadlock detection code
checks if any thread is waiting on an OS signal; currently it doesn't take
this into account at all. POSIX has the relevant information, but we need it
to call the RTS instead of the other way around to avoid making the RTS a
client of POSIX. We could accomplish this a number of ways, including using
a counter.

Best,
Will
Michael Sperber
2012-01-08 12:03:51 UTC
Permalink
Post by Robert Ransom
In the process-id case, threads waiting on a process ID's placeholder
are awakened by process-terminated-children, which is called by the
os-signal interrupt handler. Nothing waits on an external-event UID.
Ah, OK, but that's a bug (even documented in the manual) with the signal
code, which also needs fixing. As soon as that's fixed, the problem
goes away, right?
Post by Robert Ransom
If the deadlock-detection code isn't told that those synchronization
objects will be alerted by (an interrupt handler triggered by) an
external event of some sort, then it will falsely report that a
deadlock has occurred.
That also makes sense, but if we're talking about "external event of
some sort", rather than *the* external events (i.e. the specific
things in scheme/rts/external-event.scm) then the fix should also
happen outside of external-event.scm, deeper down in the thread system,
is my current thinking. Does that make sense?

(I'm trying to think this through until I find enough time to actually
work on the fix. I really appreciate your patience.)
--
Cheers =8-} Mike
Friede, Völkerverständigung und überhaupt blabla
Robert Ransom
2012-01-10 09:53:34 UTC
Permalink
Post by Michael Sperber
Post by Robert Ransom
In the process-id case, threads waiting on a process ID's placeholder
are awakened by process-terminated-children, which is called by the
os-signal interrupt handler. Nothing waits on an external-event UID.
Ah, OK, but that's a bug (even documented in the manual) with the signal
code, which also needs fixing. As soon as that's fixed, the problem
goes away, right?
process-id placeholders are separate from signal queues, and their
deadlock-detection issues need to be fixed separately. (We can't just
use the fix for deadlock detection when threads are waiting on signal
queues to handle the case of process-id placeholders, because
something needs to listen for SIGCHLD all the time, even when no
threads are currently waiting on process-id placeholders.)
Post by Michael Sperber
Post by Robert Ransom
If the deadlock-detection code isn't told that those synchronization
objects will be alerted by (an interrupt handler triggered by) an
external event of some sort, then it will falsely report that a
deadlock has occurred.
That also makes sense, but if we're talking about "external event of
some sort", rather than *the* external events (i.e. the specific
things in scheme/rts/external-event.scm) then the fix should also
happen outside of external-event.scm, deeper down in the thread system,
is my current thinking. Does that make sense?
It makes sense to keep that mechanism separate from the
external-events package. See my (revised)
wait-for-child-process-deadlock-fix branch (attached, and pushed using
hg-git to https://git.torproject.org/user/rransom/scheme48.git) for
that.


Robert Ransom
Robert Ransom
2012-01-11 06:37:13 UTC
Permalink
Post by Robert Ransom
Post by Michael Sperber
That also makes sense, but if we're talking about "external event of
some sort", rather than *the* external events (i.e. the specific
things in scheme/rts/external-event.scm) then the fix should also
happen outside of external-event.scm, deeper down in the thread system,
is my current thinking. Does that make sense?
It makes sense to keep that mechanism separate from the
external-events package. See my (revised)
wait-for-child-process-deadlock-fix branch (attached, and pushed using
hg-git to https://git.torproject.org/user/rransom/scheme48.git) for
that.
I've found some bugs in this branch; I'll send a revised branch soon.


Robert Ransom
Robert Ransom
2012-01-11 08:46:54 UTC
Permalink
Post by Robert Ransom
Post by Robert Ransom
Post by Michael Sperber
That also makes sense, but if we're talking about "external event of
some sort", rather than *the* external events (i.e. the specific
things in scheme/rts/external-event.scm) then the fix should also
happen outside of external-event.scm, deeper down in the thread system,
is my current thinking. Does that make sense?
It makes sense to keep that mechanism separate from the
external-events package. See my (revised)
wait-for-child-process-deadlock-fix branch (attached, and pushed using
hg-git to https://git.torproject.org/user/rransom/scheme48.git) for
that.
I've found some bugs in this branch; I'll send a revised branch soon.
See attached.


Robert Ransom
Robert Ransom
2012-01-10 10:08:56 UTC
Permalink
Post by Michael Sperber
Post by Robert Ransom
Regarding getaddrinfo, I think the Right Thing is to make the Scheme
interface to getaddrinfo fill in a placeholder, rather than using the
external event system directly as it does now.
OK - it still seems to me the hard part is communicating the information
from C to Scheme somehow.
The external-events system is broken, both as used by getaddrinfo and
when used with long-lived UIDs.

When a UID is allocated from C for an event, then passed to a Scheme
thread which waits for that UID once, there is a race condition -- if
C signals an event with that UID before Scheme starts waiting for it,
the event will be dropped on the floor, and the Scheme thread will not
be awakened unless and until some other event is allocated the same
UID and is signaled.

When a long-lived UID is allocated for a class of events, and a Scheme
thread waits for that UID repeatedly, there is a different race
condition -- if C signals that UID between the time that Scheme checks
for events previously queued from C and the time that it waits for a
new event, the event signal will be dropped on the floor.

See my event-waiters branch (attached and pushed to hg-git) for a
synchronization primitive which can be used to fix the latter race
condition, provided that the event-waiter is created before C starts
signaling events of a particular class.

To fix the race condition in getaddrinfo-style usage of external
events, we need a way to allocate some sort of event identifier from
Scheme, pass the identifier to C, and pass the result back to Scheme
to be put in a placeholder.


Robert Ransom
Loading...