[RFC] Add run_cleanup, do_with_cleanup

Thu Sep 24 07:18:13 BST 2009

2009/9/24 Andrew Bennetts <andrew.bennetts at canonical.com>:
> Martin Pool wrote:
>> 2009/9/24 Andrew Bennetts <andrew.bennetts at canonical.com>:
> [...]
>>
>> (I haven't read the diff yet, but we did talk about it ...)
>>
>> I was originally suggesting that we should handle this by putting all
>> cleanup-type code through a single bottleneck.  I think you're on to
>> something in saying (if you are) that we should instead put it through
>> common policies for different types of cleanup.
>
> Yes, I think that's what I'm getting at (the picture has been emerging
> pretty gradually!).
>
>> My sense is that this is more about the type of cleanup than the place
>> it's being called from, so I'd expect that normally we will be
>> changing the functions called at cleanup, not the code with the
>> try/finally block.  Of course if there is code with multiply nested
>> try/finally blocks and we can make it cleaner by changing to a
>> higher-order-function that runs them in order, that's great.
>
> Right, that's a good example where a h-o-f can make the code cleaner, not
> just more robust.
>
>> I think we can distinguish different types of severity of failure.
>> For instance, if we fail to unlock a branch, that's worth telling the
>> user about, but it's bad to give them such a large or severe message
>> that it obscures their view of what actually went wrong, as often
>> seems to happen in bugs.
>>
>> One blue-sky option here would be to indicate success or failure by
>> return code: unlock (say) returns True if it actually unlocked, False
>> if it failed.  Then code that really cares can see, but the default
>> will be not to interrupt processing.  This gives you a way to have
>> failures that are 'slightly interesting'.
>
> I don't particularly like checking return values.

OK, but why?  Obviously in general in Python one should be using
exceptions rather than return code as the normal convention, but are
there any other drawbacks?  The other main reason is that return codes
can easily be ignored but that is (perhaps) just what we want here.

One problem with return codes is that, if you return a boolean, you
don't get any explanation of what went wrong.  We could potentially
return an Exception object, but that might be getting too weird.

> I think it might be nice
> to be able to say "foo.unlock(on_error='warn')", or
> on_error='log'/'ignore'/'raise'.  I'm not sure which we'd like to have as
> the default, and it's also questionable about whether we really want to
> burden all unlock implementations with this, but it's an interesting
> strawman API.

That seems ok, at least assuming that you expect different callers to
prefer to get errors, warnings, or nothing.   But aside from tests
specifically for locking, I'm not sure that they will.

> My thinking here is that *usually* an unlock failure is probably not
> something to report to the user, but there are times when we definitely want
> errors to propagate directly (e.g. an explicit invocation of break-lock, and
> probably during the test suite).

I don't think break-lock calls unlock, it calls break_lock, which
presumably would always error.

> So there's a mix of inputs here.  I'm having trouble articulating this
> clearly, but roughly I feel that what's desired in any individual case may
> depend on a combination of:
>
>  - the cleanup operation itself;
>  - what the callsite intends (some places may explicitly expect and suppress
>   particular errors, so emitting something to the UI instead of raising
>   something the callsite can catch may be a bad thing);
>  - any global policy, like debug flags;
>  - and maybe also the specific error (e.g. KeyboardInterrupt vs. connection
>   lost vs. an assertion about unfinished progress bars)?
>
> The challenge seems to be balancing these desires while keeping things
> reasonable clean and simple.

My idea was that we should try, at least in a branch, just changing
the policy, and see what happens in several dimensions: how that
changes the look & feel of the code as you change it, what tests fail
if any and how easy & clean it is to fix them, and what positive or
negative effects this has on users.  Then we'll have more data on the
balance.

>> It seems to me the most common requests are:
>>
>>  * TooManyConcurrentRequests as a knock-on effect from a previous
>> operation being interrupted either because of a client-side bug or
>> error, or a connection dropping
>>  * Unlock failing because the connection has dropped or in conjunction
>> with the above
>>
>> The particular cleanups are probably unlock and then
>> abort_write_group, and while they do have consequences failing to
>> complete them is not a very severe problem.
>
> Right.
>
> In the case of abort_write_group, the consequence of failing to cleanup is
> 99% uninteresting to the user (there's some disk space that could be
> reclaimed manually in .bzr/repository/upload, but meh).
>
> In the case of unlock, if it's a read lock it's totally uninteresting.  If
> it's a write lock then the user probably should be told about it because
> they will probably need to run break-lock manually.
>
> So my desire is:
>
>  a) for the short term, find something reasonably cheap to do that addresses
>    the bugs we have (even if incomplete or a bit ugly), and
>  b) figure out, at least approximately, what we'd like to be doing about the
>    problem long term, so that can make sure whatever short term things we
>    do don't lead us away from that.
>
> So the code in this patch, and the proposed policy, isn't meant to be the
> final answer, but I hope a useful stepping stone towards a final answer that
> can address some bugs today.

I think that policy sounds ok, though maybe we need different
functions for important and unimportant cleanups.  I suggest taking
the implementation all the way through with say unlock so that we can
see how it actually behaves.

=== added file 'bzrlib/cleanup.py'

--- bzrlib/cleanup.py	1970-01-01 00:00:00 +0000
+++ bzrlib/cleanup.py	2009-09-24 01:05:22 +0000

+"""Helpers for managing cleanup functions and the errors they might raise.
+
+Generally, code that wants to perform some cleanup at the end of an action will
+look like this::
+
+    from bzrlib.cleanups import run_cleanup
+    try:
+        do_something()
+    finally:
+        run_cleanup(cleanup_something)
+
+Any errors from `cleanup_something` will be logged, but not raised.
+Importantly, any errors from do_something will be propagated.

I wonder if generally this should be going inside the function run
from cleanups - but we can change that later.

+
+There is also convenience function for running multiple, independent cleanups
+in sequence: run_cleanups.  e.g.::
+
+    try:
+        do_something()
+    finally:
+        run_cleanups([cleanup_func_a, cleanup_func_b], ...)
+
+Developers can use the `-Dcleanup` debug flag to cause cleanup errors to be
+reported in the UI as well as logged.
+
+Note the tradeoff that run_cleanup/run_cleanups makes: errors from
+`do_something` will not be obscured by errors from `cleanup_something`, but
+errors from `cleanup_something` will never reach the user, even if there is no
+error from `do_something`.  So run_cleanup is good to use when a failure of
+internal housekeeping (e.g. failure to finish a progress bar) is unimportant to
+a user.
+
+If you want to be certain that the first, and only the first, error is raised,
+then use::
+
+    do_with_cleanups(do_something, cleanups)
+
+This is more inconvenient (because you need to make every try block a
+function), but will ensure that the first error encountered is the one raised,
+while also ensuring all cleanups are run.
+"""

Nice docstring.

These could perhaps become with blocks when we require python2.5 (or 2.6?).

+
+
+import sys
+from bzrlib import (
+    debug,
+    trace,
+    )
+
+def _log_cleanup_error(exc):
+    trace.mutter('Cleanup failed:')
+    trace.log_exception_quietly()
+    if 'cleanup' in debug.debug_flags:
+        trace.warning('bzr: warning: Cleanup failed: %s', exc)
+
+
+def run_cleanup(func, *args, **kwargs):
+    """Run func(*args, **kwargs), logging but not propagating any error it
+    raises.
+
+    :returns: True if func raised no errors, else False.
+    """
+    try:
+        func(*args, **kwargs)
+    except KeyboardInterrupt:
+        raise
+    except Exception, exc:
+        _log_cleanup_error(exc)
+        return False
+    return True

Do you need to handle KeyboardInterrupt specially?  Isn't it outside
of Exception, or is that only from 2.5 on?


+
+
+def run_cleanup_reporting_errors(func, *args, **kwargs):
+    try:
+        func(*args, **kwargs)
+    except KeyboardInterrupt:
+        raise
+    except Exception, exc:
+        trace.mutter('Cleanup failed:')
+        trace.log_exception_quietly()
+        trace.warning('Cleanup failed: %s', exc)
+        return False
+    return True

Ok, I can see this is a bit different but it probably wants a
docstring to say so.

+
+
+def run_cleanups(funcs, on_error='log'):
+    """Run a series of cleanup functions.
+
+    :param errors: One of 'log', 'warn first', 'warn all'
+    """
+    seen_error = False
+    for func in funcs:
+        if on_error == 'log' or (on_error == 'warn first' and seen_error):
+            seen_error |= run_cleanup(func)
+        else:
+            seen_error |= run_cleanup_reporting_errors(func)
+
+
+def do_with_cleanups(func, cleanup_funcs):
+    """Run `func`, then call all the cleanup_funcs.
+
+    All the cleanup_funcs are guaranteed to be run.  The first exception raised
+    by func or any of the cleanup_funcs is the one that will be propagted by
+    this function (subsequent errors are caught and logged).

typo

+
+    Conceptually similar to::
+
+        try:
+            return func()
+        finally:
+            for cleanup in cleanup_funcs:
+                cleanup()
+
+    It avoids several problems with using try/finally directly:
+     * an exception from func will not be obscured by a subsequent exception
+       from a cleanup.
+     * an exception from a cleanup will not prevent other cleanups from
+       running (but the first exception encountered is still the one
+       propagated).
+
+    Unike `run_cleanup`, `do_with_cleanups` can propagate an exception from a
+    cleanup, but only if there is no exception from func.
+    """
+    # As correct as Python 2.4 allows.
+    try:
+        result = func()
+    except:
+        # We have an exception from func already, so suppress cleanup errors.
+        run_cleanups(cleanup_funcs)
+        raise
+    else:
+        # No exception from func, so allow the first exception from
+        # cleanup_funcs to propagate if one occurs (but only after running all
+        # of them).
+        exc_info = None
+        for cleanup in cleanup_funcs:
+            # XXX: Hmm, if KeyboardInterrupt arrives at exactly this line, we
+            # won't run all cleanups... perhaps we should temporarily install a
+            # SIGINT handler?
+            if exc_info is None:
+                try:
+                    cleanup()
+                except:
+                    # This is the first cleanup to fail, so remember its
+                    # details.
+                    exc_info = sys.exc_info()
+            else:
+                # We already have an exception to propagate, so log any errors
+                # but don't propagate them.
+                run_cleanup(cleanup)
+        if exc_info is not None:
+            raise exc_info[0], exc_info[1], exc_info[2]
+        # No error, so we can return the result
+        return result
+
+

=== modified file 'bzrlib/tests/__init__.py'
--- bzrlib/tests/__init__.py	2009-09-22 04:25:05 +0000
+++ bzrlib/tests/__init__.py	2009-09-24 01:05:22 +0000
@@ -3854,6 +3854,188 @@
     This function can be replaced if you need to change the default test
     suite on a global basis, but it is not encouraged.
     """
+<<<<<<< TREE
+=======
+    testmod_names = [

Some Launchpad mp diff fail here, it seems...

     loader = TestUtil.TestLoader()

OK, so +1 from me, but the proof will be in how it would handle the
bugs reported as TooMany*.  It might be worth grepping for them and
looking at their call stacks.


-- 
Martin <http://launchpad.net/~mbp/>