[rfc] dealing with knock-on errors

Thu Jul 9 02:07:49 BST 2009

Bazaar has some important cleanup code typically run from try/finally
blocks, most importantly being releasing locks, but also finishing
write groups and progress bars.  We tend to have recurrent problems
with what you could call knock-on errors arising while running this
code: for example, some error occurs on the server causing a client
rpc to fail, so the client tries to unlock, and then that fails too
because the network connection is in a bad state.

The problem is that the error raised inside the cleanup code typically
masks the original error, and it's usually tells us less about the
original error.  For example, if the network connection has been
dropped, it's usually more helpful for debugging to hear about why it
was originally dropped than to hear that the lock couldn't be released
because the connection was dropped.  This is true for bzr developers
and also for users, for whom a good error message may help them
resolve a problem themselves, and it also creates overhead in the bug
tracker because the original bug report is not useful.

Although most failures in these methods are knock-on failures, it is
of course possible that we'll have bugs or other types of failure in
these methods, so it's not good either in testing or production to
just squelch all errors.

A few ideas have been floated to tackle this, including wrapping all
cleanup into try/except/finally so there's an explicit error and
non-error cleanup path, but many have bad tradeoffs.

I have a different idea, which is to have one or two functions
encapsulating the policy of "an error occurred in a cleanup function"
- they can be called either when it's about to raise an error, or when
it sees an error has occurred at a lower level.  We can then change
that setting: we can either just give a warning, or we can raise an
error.  We could potentially change this at runtime through a debug
flag, or we could use Python warnings and control it using -Werror.
make check runs with -Werror turned on so you won't get away with
having this happen during testing, but during production use or normal
debugging you'll normally see the original exception.  It could even
perhaps log the traceback of the lock failure always, even if it's not
going to raise an exception.

-- 
Martin <http://launchpad.net/~mbp/>