[Merge] performance increases for 'bzr status'
Robert Collins
robertc at robertcollins.net
Tue May 30 00:49:34 BST 2006
On Sat, 2006-05-27 at 04:29 -0500, John Arbash Meinel wrote:
> I spent some time profiling the 'bzr status' calls, and came up with a
> few changes. I'm using the new benchmark stuff, and its pretty nice.
> (Though the nicest thing is because of lsprof).
>
> changes are here:
> http://bzr.arbash-meinel.com/branches/bzr/minor-performance
>
>
> first 'bzrlib.delta.compare_trees' now uses 'Tree.list_files()' rather
> than iterating randomly. This eliminates all of the 'id2path' calls,
> since list_files builds up the paths as it goes along.
>
> Basically it uses the iterators from the pre and post trees, and then
> syncs up based on path. keeping track of file ids that don't match, and
> matching them up if they show up later.
>
> second WorkingTree.list_files uses a stack rather than using a recursive
> function. This shaves off a few hundred ms. I tuned list_files quite a bit.
>
> third bzrlib.osutils.file_kind can be done with a single lstat and
> dictionary lookup. this function gets called a lot, and optimizing it
> has some pretty good performance benefits.
>
> fourth Inventory._read_tree_state already knows what path the file is
> at, it has to be passed in, but it doesn't pass that path to
> WorkingTree. So make path an optional component of get_file_sha1 and
> is_executable
>
> fifth Inventory.iter_entries also doesn't need to be a recursive
> function. Using properly designed stacks gets the profiled time down as
> much as to 1/6th for a kernel sized tree.
>
> Do we still want to pre-scan the hashcache when we create a WorkingTree?
> I know the point was to load up the inode cache by scanning in-order
> rather than doing a random walk. It costs us by require a stat of every
> file in the tree, which we then effectively throw away.
> I realize it isn't something that benchmarks well, because benchmarks
> pretty much always run hot. But in my testing, it shaves 500ms off the
> 'bzr status' time for a kernel sized tree.
> And now that my 'compare_trees' is at least going in sorted filesystem
> order, it may be closer to inode order than going in a random file_id
> hash sorted order.
>
>
> Before my changes the times looked like this:
> test_no_changes_known_kernel_like_tree OK 5174ms/91565ms
> test_no_ignored_unknown_kernel_like_tree OK 1665ms/15044ms
>
> test_no_changes_known_kernel_like_tree OK 3591ms/83970ms
> test_no_ignored_unknown_kernel_like_tree OK 1432ms/14309ms
>
>
> the 'no_ignored' tree is a kernel_like tree where everything is added,
> but nothing has been committed yet. Which is only one side of status
> (since one tree is empty).
> After the first commit 'bzr status' now has to compare both trees. So it
> is quite a bit slower. (It has to check the difference for each file, etc).
Cool.
/...
> +
> +class WorkingTreeBenchmark(Benchmark):
> +
> + def test_list_files_kernel_like_tree(self):
> + self.make_kernel_like_tree()
> + self.run_bzr('add')
> + tree = WorkingTree.open('.')
> + self.time(list, tree.list_files())
> +
> + def test_list_files_unknown_kernel_like_tree(self):
> + self.make_kernel_like_tree()
> + tree = WorkingTree.open('.')
> + for root, dirs, files in os.walk('.'):
> + if '.bzr' in dirs:
> + dirs.remove('.bzr')
> + if root == '.':
> + continue
> + tree.add(root)
> + self.time(list, tree.list_files())
> +
These could do with some explanation - the second one specifically I had
to read twice to determine what code path you are hoping to expose
issues in.
> === modified file 'BRANCH.TODO'
> --- BRANCH.TODO
> +++ BRANCH.TODO
> @@ -1,3 +1,5 @@
> # This file is for listing TODOs for branches that are being worked on.
> # It should ALWAYS be empty in the mainline or in integration branches.
> #
> +
> +Test osutils.file_kind for more than just regular files and directories
This should either be done, or put in TODO before merging.
> ...
> + # If the name changes, or the parent_id changes, we have a rename
> + # (if we move a parent, that doesn't count as a rename for the file)
> + if (old_entry.name != new_entry.name
> + or old_entry.parent_id != new_entry.parent_id):
> + delta.renamed.append((old_path,
> + new_path,
> + old_entry.file_id, old_entry.kind,
> + text_modified, meta_modified))
> + elif text_modified or meta_modified:
> + delta.modified.append((new_path, new_entry.file_id, new_entry.kind,
> + text_modified, meta_modified))
> + elif want_unchanged:
> + delta.unchanged.append((new_path, new_entry.file_id, new_entry.kind))
We have a method on InventoryEntry that is very close to the logic here.
Maybe we should have a TODO to combine/reuse this logic.
> +
> + def handle_old(path, entry):
> + """old entry without a new entry match"""
> + if entry.file_id in added:
> + # Actually this is a rename, we found a new file_id earlier
> + # at a different location, so it is no-longer added
> + x_new_path, x_new_entry = added.pop(entry.file_id)
> + check_matching(path, entry, x_new_path, x_new_entry)
> + else:
> + # We have an old_file_id which doesn't line up with a new_file_id
> + # So this file looks to be removed
> + assert entry.file_id not in removed
> + removed[entry.file_id] = path, entry
> +
> + def handle_new(path, entry):
> + """new entry without an old entry match"""
> + if entry.file_id in removed:
> + # We saw this file_id earlier at an old different location
> + # it is no longer removed, just renamed
> + x_old_path, x_old_entry = removed.pop(entry.file_id)
> + check_matching(x_old_path, x_old_entry, path, entry)
> + else:
> + # We have a new file which does not match an old file
> + # mark it as added
> + assert entry.file_id not in added
> + added[entry.file_id] = path, entry
These two methods could use a little more exposition in the docstring I
think. (I.e. 'This method is called when a file appears to have been
deleted. It can converted what we thought was an added file to a rename,
as we gain information about what files are where.')
...
> + delta.added.append((new_path, new_entry.file_id, new_entry.kind))
> +
> delta.removed.sort()
> delta.added.sort()
> delta.renamed.sort()
Do we need to sort still ? (If we are adding in alphabetical order ?)
> === modified file 'bzrlib/osutils.py'
> --- bzrlib/osutils.py
> +++ bzrlib/osutils.py
> @@ -25,6 +25,7 @@
> import re
> import sha
> import shutil
> +import stat
> import string
> import sys
> import time
> @@ -38,6 +39,7 @@
> PathNotChild,
> IllegalPath,
> )
> +from bzrlib.symbol_versioning import *
> from bzrlib.trace import mutter
> import bzrlib.win32console
>
> @@ -74,23 +76,19 @@
> return f
>
>
> +_formats = {
> + stat.S_IFDIR:'directory',
> + stat.S_IFCHR:'chardev',
> + stat.S_IFBLK:'block',
> + stat.S_IFREG:'file',
> + stat.S_IFIFO:'fifo',
> + stat.S_IFLNK:'symlink',
> + stat.S_IFSOCK:'socket',
> +}
> def file_kind(f):
> - mode = os.lstat(f)[ST_MODE]
> - if S_ISREG(mode):
> - return 'file'
> - elif S_ISDIR(mode):
> - return 'directory'
> - elif S_ISLNK(mode):
> - return 'symlink'
> - elif S_ISCHR(mode):
> - return 'chardev'
> - elif S_ISBLK(mode):
> - return 'block'
> - elif S_ISFIFO(mode):
> - return 'fifo'
> - elif S_ISSOCK(mode):
> - return 'socket'
> - else:
> + try:
> + return _formats[os.lstat(f).st_mode & 0170000]
> + except KeyError:
> return 'unknown'
perhaps:
return _formats.get(os.lstat(f).st_mode & 0170000', 'unknown')
for more tuning though:
def file_kind(f, _formats=_formats, _unknown='unknown'):
return _formats.get(os.lstat(f).st_mode & 0170000, _unknown)
AIUI this should be about as fast as you can get: the use of keyword
parameters to pass _formats and _unknown makes them into locals rather
than globals, so they get compiled into the bytecode, and not allocated
each time.
Its possible the try: except: is actually faster than the get() call -
but even there the keyword parameter approach should be a benefit.
+1 - the only Must is the change to Branch.TODO, everything else is
advisory.
Rob
--
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 191 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060530/ac9e9953/attachment.pgp
More information about the bazaar
mailing list