[Merge] Two Phase Locking

Fri Jul 7 11:08:30 BST 2006

On 30 Jun 2006, John Arbash Meinel <john at arbash-meinel.com> wrote:

> While going through the code, I realized that Branch and WorkingTree
> don't agree on the order of locking, and Branch wasn't cleaning up if it
> had problems while locking.
> 
> So I wrote the attached patch, which tests all branch implementations
> and working tree implementations to make sure that they lock in the
> correct order, and unlock in the correct order. It instruments the
> lock/unlock calls so that they can be recorded/forced to fail.
> 
> After discussing with Robert on IRC, we figured that objects should
> always lock Other before Self. In a single threaded app, it doesn't
> really matter, and there is a nice property of making sure you can lock
> yourself before you try to lock other (shallow is checked before deep).
> However, if you had a multi-threaded app, it might believe that it was
> successfully locked, even though Other wasn't locked yet.
> So we went for Other before Self, because it seems more correct.

I'm not totally sure I buy that - in a mt app we will need to consider 
locking between threads, which can be done in various ways.  I think on
the whole you would not want to override the existing external locks
to do that, but rather have between-thread locks.  Those would naturally
be held during the process of taking or releasing the external locks, so
this would not be visible.

I think I can give a better explanation for that.  We want locks to be
taken in a consistent order to avoid deadlocks.  We have a layered
structure, so we need to specify the ordering between layers, and also
between lockable objects on the same layer.  In general it's better to
lock from the bottom up.  (OK, that's not much better...)

> The only question left is whether calling Branch.unlock() should always
> unlock the Repository, even if there is some exception while unlocking
> Branch.control_files().
> 
> It is cleaner to always unlock Other, so you don't leave the Repository
> locked if you (for whatever reason) failed to unlock the WorkingTree.
> 
> But going back to the multi-threaded POV, if you expect things are safe
> if Self is locked, this could put you in the position of Self being
> locked but not Other.
> 
> Our code base is rather secure against either locking mechanism. Mostly
> because every function that needs a lock asserts '@needs_read_lock' or
> '@needs_write_lock'.
> So even if you didn't have the Repository locked, the next function
> which actually needed it would lock it for you.
> 
> The code was always unlocking, so I kept that behavior, and the tests
> are asserting that it is occurring. But I want to bring up the
> discussion if it would be better to change this.
> 
> One other point is that always unlocking means you are trying to do
> something while an exception is occurring. So if the sftp link goes
> down, you actually try twice, and get 2 exceptions, rather than just one.

There's not really any good way to avoid it.  I suppose we could make it
a bit nicer by having the sftp transport know "I'm dead", and just
immediately fail operations rather than getting some kind of IO error.

> Anyway, the attached patch fixes Branch.lock/unlock, and has some tests
> for asserting correct behavior, whatever we decide to do. It is present
> on 'jam-integration'.

Just one thing I'd like to fix or discuss, otherwise +1 to come in.  

>      def lock_write(self):
> -        # TODO: test for failed two phase locks. This is known broken.
> -        self.control_files.lock_write()
>          self.repository.lock_write()
> +        try:
> +            self.control_files.lock_write()
> +        except:
> +            self.repository.unlock()
> +            raise

I just discovered something interesting in Python - I had always
imagined that a bare 'raise' was equivalent to raising the same object,
and so you lost the original traceback.  But it turns out that it does
preserve it, which is rather nice (and different to Java and C#, I
think).  That relieves something that had always worried me about this
try/except/raise pattern.

> === modified file bzrlib/lock.py // last-changed:john at arbash-meinel.com-2006063
> ... 0195802-c6e78240f04c9d52
> --- bzrlib/lock.py
> +++ bzrlib/lock.py
> @@ -39,7 +39,66 @@
>  import sys
>  
>  from bzrlib.trace import mutter
> -from bzrlib.errors import LockError
> +from bzrlib.errors import LockError, TestPreventLocking
> +
> +
> +class LockWrapper(object):
> +    """A wrapper which lets us set locking ability.
> +
> +    This also lets us record what objects were locked in what order,
> +    to ensure that locking happens correctly.
> +    """

I'd rather strongly prefer that code like this which is only used during
testing be put in the test subdirectory, rather than in the module which
it supports.  (Or maybe the intention is that this be used elsewhere?
But it seems unlikely.)

A couple of reasons:

 - having code here that's never used in normal operation can't help
   with load time and memory usage

 - things like this shouldn't be relied on by non-test code

 - people reading this file in the first instance just want to see what
   it actually does, not how it's tested

 - it makes code review easier - for example I think the getattr would
   be harder to justify in regular code, but makes sense for what it's
   doing here

Does this make sense?

Some things are a matter of judgement - having just a couple of test
helper methods on a class is probably fine, as is having transport
servers that are currently only used by the tests.

Then you could also move TestPreventLocking there too.

-- 
Martin