What should happen if bzr is unable to complete a task?

Wichmann, Mats D mats.d.wichmann at intel.com
Thu Jun 18 17:23:15 BST 2009



________________________________
From: bazaar-bounces at lists.canonical.com [mailto:bazaar-bounces at lists.canonical.com] On Behalf Of Maritza Mendez
Sent: Thursday, June 18, 2009 10:07 AM
To: Bazaar
Subject: What should happen if bzr is unable to complete a task?


I've discussed this briefly with Alexander in the special context of 'bzr qbranch' in Bazaar Explorer, but there is a more general question I am trying to express.  Please help me find the best way to ask this question.

I have a badly behaving set of computers at work.  Actually, I think the problem is with the subnet and our IT guys are working on it.  But meanwhile it has made me think about a more general question.  What we see is that a 'bzr branch' operation can hang if the parent location is outside our firewall.  Inside, no problem.  At some point, the progress bar freezes and nothing happens.  I've waited at least ten minutes in some cases.  Nothing.  In each case, we see that the working directory has been created with the expected structure but no working tree (obviously) and there is a .fetch file.  We're able to determine that the bzr process is still holding a handle to the fetch file, but the file has stopped growing.

No problem: kill bzr and start over.  The fact that this happens more often than not for us is our problem.  We have some network problem that mysteriously affects bzr and (as far as we know) nothing else.   We're working on that.

But what about unattended scripts?  And what about the bzr-explorer and future tools which invoke bzr (rather than calling bzrlib)?  How should they recover if bzr stalls because it can't get what it needs?


I've had this exact thing happen. most recently yesterday :)  The LSB project has an autobuild scheme where seven different systems (some virtual machines) build a whole load of stuff every night, and the builds of course make sure they're using the latest code from our master bzr branches.  In five of the "machines" it turns out they're proximate - boxes sitting in the same pair of racks in a data center as the server hosting the master branches is, but two of the machines are contributed resources that are far removed from our main machines, and are subject to periodic slowness - both are big IBM machines where we're granted a virtual machine to work in, but other things going in other virtual machines on may cause slowness due to contention for finite resources, mainly network, and of course any such factors are "external" to the VMs so we can't see what might be going on.

Just the last couple of days I was getting failure emails on one of these that indicated a lock being held so an operation couldn't be performed.  Going in, I found a 57-hour-old "bzr branch" that was just "stuck".  I killed it, broke the locks, and everything's fine now - I didn't really investigate as I'm so used to one transient or another messing up something in the autobuilds, but why would a bzr branch just hang and never finish nor time out?  We certainly didn't have any way to recover from this automatically, although now I'm aware of the issue I might put some sort of watchdog in place.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://lists.ubuntu.com/archives/bazaar/attachments/20090618/71105eb6/attachment.htm 


More information about the bazaar mailing list