[RFC] Add run_cleanup, do_with_cleanup

Thu Sep 24 05:00:25 BST 2009

2009/9/24 Andrew Bennetts <andrew.bennetts at canonical.com>:
> [Trying an RFC for a branch by cross-posting to both the merge proposal on
> Launchpad, and the bazaar@ list.]
>
> Hi all,
>
> Here's the problem.  In this example, Python gives no way for final_func to know
> if try_func raised an error or not:
>
>    try:
>        try_func()
>    finally:
>        final_func()
>
> Not even by sys._getframe hacks, AFAICT.  I can go into some gory details about
> why not, but more important is what to do about it.
>
> It's problematic because, in general, we don't want an error from final_func to
> override an error from try_func.  This means that sometimes users get errors
> about, say, abort_write_group failing, when what originally went wrong was
> “connection lost”.  The various TooManyConcurrentRequests bugs are the most
> common symptom of this.
>
> This brings me to <https://bugs.launchpad.net/bzr/+bug/429747>, and
> <https://code.launchpad.net/~spiv/bzr/cleanup-hof/+merge/12318>.
>
> Bug 429747 proposes a higher-order-function that all cleanups should use to at
> least give us a common place to control policy about log vs. raise, etc.  That
> branch implements that, as a function called run_cleanup (and some variations).
> It lives in a new module, bzrlib.cleanups.  I have tried to give the new module
> and its contents fairly reasonable docstrings, and clear tests.
>
> The downside of run_cleanup is that it will suppress some exceptions that should
> be propagated, but that seems to be unavoidable given the limitations of Python.
> (And note that developers can always use -Dcleanup to get UI warnings about
> failures in run_cleanup).
>
> The branch also implements something called do_with_cleanups, which can actually
> be pretty robust and correct, at the cost of making simple callsites a bit
> uglier.  I have a follow-on patch that adjusts bzrlib.commit to use this, which
> would fix some outstanding bugs.
>

> So, here's the RFC: I think the policy should be:
>
>  * use run_cleanup if a cleanup failure is likely to be unimportant to a user
>   (or at least, the chance of it being important is < the chance of the main
>   func raising an error that should not be obscured).
>  * consider changing the cleanup function itself to never raise errors, e.g.
>   abort_write_group(suppress_error=True).  Then idiomatic try/finally is safe.
>   Again, only applicable if failures during cleanup are "uninteresting".
>  * use run_cleanups if you are running multiple cleanups; it takes care of
>   running them all even if an exception happens.
>  * if you want very correct behaviour, use do_with_cleanups.  This tends to make
>   simple callsites a bit uglier, but in already complex code (like commit) it
>   may be no worse.
>
> If agreed, I can add this to HACKING.txt.

(I haven't read the diff yet, but we did talk about it ...)

I was originally suggesting that we should handle this by putting all
cleanup-type code through a single bottleneck.  I think you're on to
something in saying (if you are) that we should instead put it through
common policies for different types of cleanup.

My sense is that this is more about the type of cleanup than the place
it's being called from, so I'd expect that normally we will be
changing the functions called at cleanup, not the code with the
try/finally block.  Of course if there is code with multiply nested
try/finally blocks and we can make it cleaner by changing to a
higher-order-function that runs them in order, that's great.

I think we can distinguish different types of severity of failure.
For instance, if we fail to unlock a branch, that's worth telling the
user about, but it's bad to give them such a large or severe message
that it obscures their view of what actually went wrong, as often
seems to happen in bugs.

One blue-sky option here would be to indicate success or failure by
return code: unlock (say) returns True if it actually unlocked, False
if it failed.  Then code that really cares can see, but the default
will be not to interrupt processing.  This gives you a way to have
failures that are 'slightly interesting'.

It seems to me the most common requests are:

 * TooManyConcurrentRequests as a knock-on effect from a previous
operation being interrupted either because of a client-side bug or
error, or a connection dropping
 * Unlock failing because the connection has dropped or in conjunction
with the above

The particular cleanups are probably unlock and then
abort_write_group, and while they do have consequences failing to
complete them is not a very severe problem.

-- 
Martin <http://launchpad.net/~mbp/>