[MERGE][#230567] Faster (local) branch

Fri May 23 23:36:07 BST 2008

Sorry for sending this twice, I hit send to early before.

Ian Clatworthy wrote
| Ian Clatworthy wrote:
|
|> This patch improves Bazaar's local branch in shared-repo time
|> from 22 secs to 15.8 secs on a Mozilla tree with 12456 revisions.
|> It also improves the local branch time on an OpenOffice.org
|> tree with 305k revisions from 73.4 secs to 41.5 secs.
|
| This version fixes a bug in the previous version and goes back
| to using the preferred way of updating the inventory. It's a
| fraction of a second (0.2 or so) slower but that's the right
| tradeoff vs using an 'evil' API.
|
| Ian C.
|

I think the reason we had the slow route, is that people wanted to be able to
'bzr co URL .' into a directory that already had files. I don't know who asked
for that, etc, but I remember a conversation about it.

I can say that if it is costing us a lot, then we might want to reconsider how
we do it. (For example, if the overhead is statting the possible target path, we
could avoid that if we know its parent directory does not exist, etc.)

Other comments, though...

1) You set the executability after renaming the files. It seems backward, but
maybe it is only changing ones that already existed?

2) If it is the wt.abspath call, I know of ways to make 'pathjoin' faster. At
the moment it tries hard to handle cases we don't care about. join('foo////',
'bar') etc.

Arguably it is the callers responsibility to know that it isn't passing an
already absolute path to wt.abspath(), and then the function can be a trivial:

return self.basedir + '/' + filename

Or if we want a bit of safety

if (filename.startswith('/')
~    or sys.platform == 'win32' and filename[1:2] == ':'):
~    raise AssertionError...

return self.basedir + '/' + filename

3) If this really is the best route, how about we turn it on if we know we just
created the target directory. (Hence we *know* it is empty.) Or even doing a
"os.listdir('target_dir')" and seeing if it is empty, or only contains a '.bzr'
directory. That way 'bzr co .' can still benefit from these code paths as long
as you don't actually have files around that it should worry about.

What is the benefit of:
+        new_paths = [(fp.get_path(t), t) for t in self._needs_rename]
+        for path, trans_id in new_paths:

versus
for trans_id in self._needs_rename:
   path = fp.get_path(trans_id)

You have the same pattern later on for "exec_paths".

If you really wanted it to be fast, you need to get rid of the attribute lookup, 
and do something like:

get_path = fp.get_path

new_paths = [(get_path(t), t) for t in self._needs_rename]

Though I'm guessing that _needs_rename is the short list. For example, only the 
top-level entries in a fresh 'build_tree'. It might also depend on the order 
that we see these files, but you seem to be using iter_entries_by_dir which 
should be ok. (You just need to see all directories before you see their contents.)

I would rather not add another code path if we can help it, but I didn't do the 
profiling you did.

On windows, doing 'bzr co --lightweight' I see 36s total, and 23.8 of that is in 
'writelines' and the other 7.8s is in 'open()'.

Only leaving 6s (under lsprof) for the rest, which includes extracting texts 
from packs, etc. Though it is only 8s without lsprof, so I'm a bit surprised to 
see writelines() being that slow.

John
=:->