The case against until_no_eintr

Thu Feb 25 10:12:11 GMT 2010

On 21/02/2010, Andrew Bennetts <andrew.bennetts at canonical.com> wrote:
>
> Would it be accurate to summarise your view as “this needs to be fixed
> in Python, because it's impossible for Bazaar to fix it everywhere”?  If
> so, I absolutely agree.

Yep, that's about right. I know dealing with upstream is always a pain
and feels like twice as much work as just hacking around their
problems in your own code, but generally pays off long-term.

> One approach would be to call signal.siginterrupt(signum, False) for
> every signal, or at maybe those with signal handlers other than SIG_IGN.
> I'm not totally certain this won't have undesireable side-effects
> (anything involving signals always seems to involve unpleasant
> surprises), but on my reading so far it might be ok.  This function is
> only available since Python 2.6, but maybe we could attempt to use
> ctypes or a small C module to do it on 2.4 and 2.5.  What do you think?

That might be acceptable, but will upset some current expectations.

Python has been carefully (doing the equivalent of) calling
siginterrupt(sig, 1) for every signal handler it registers for a long
while, as otherwise it's possible to hang indefinitely in an
uninterruptable state.

The reasoning (for my own future reference as much as anything) as I
understand it is this: The python C-level signal handler sets some
state saying a signal has arrived, and calls the python-level function
when the interpreter resumes. This is because no real work can be done
in a handler, as the program is in an undetermined state when they
run. Therefore if a system function is automatically being resumed, it
will still delay the python code intended to respond to that signal.

So, while the BSD style is a much nicer interface for C code, it
doesn't work with the way Python wants to deal with signals unless
nothing is going to block for an unreasonable length of time.

> Otherwise, I don't see what's wrong with working around the bug as much
> as practical in bzrlib.  Yes, we can't fix bugs in e.g. ftplib, but we
> can at least make sure that talking to an SSH process is robust (and for
> that matter, we can make sure that large uploads via FTP are robust,
> even if we can't fix the small writes ftplib will make on the control
> channel).  It's not really much different to bzrlib.cleanup, which in
> essence provides a workaround for Python 2.4's lack of “try/finally
> blocks where exceptions during finally don't override exceptions from
> the try”, or using bzrlib.tuned_gzip to because the builtin gzip was too
> slow.

Don't want to pick a fight, but think this is rather depressing:

C:\bzr\bzr\dev>bzr blame --long bzrlib\tuned_gzip.py | grep -C1 upstream
1641.1.1  robertc at robertcollins.net 20060407 |
                                             | """Bzrlib specific gzip
tunings. We plan to feed these to the upstream gzip."""
                                             |

Had two years to get into Python 2.6 which Ubuntu have been shipping
for what, two releases now. Is there even as much as an issue on their
tracker?

> I agree some places are more important to fix than others, and that the
> fixes really need to be completely safe (unlike many of the current
> uses of until_no_eintr, which are very broken).
>
> I don't think it follows that because Bazaar has to cope with connection
> drops, so therefore failures due to EINTR are acceptable.  As a user,
> I'm ok with a failure because my internet connection dropped out.  I'm
> not ok with a failure because the Bazaar GUI I'm using had a subprocess
> finish while during a push.

Clearly failures are bad, but code trying to address an issue with a
microsecond window of opportunity might not be worthwhile when the
effect of that rare possibility is no worse than external problems
that need to be handled anyway. Making sure we can cover the recv
calls that are likely to block for seconds at a time seems a priority,
while even sendall does not as it is much harder to break.

> (And as a developer, it's very inconvenient for the SIGQUIT-for-pdb
> feature to break the active connection.)

Think I'd actually made this branch before reading your email, see the related:
<https://code.launchpad.net/~gz/bzr/ignore_sigquit_in_ssh_child_162502/+merge/19711>

Martin