[Merge] performance increases for 'bzr status'

Tue May 30 06:33:37 BST 2006

Robert Collins wrote:
> On Sat, 2006-05-27 at 04:29 -0500, John Arbash Meinel wrote:

...

> /...
>> +
>> +class WorkingTreeBenchmark(Benchmark):
>> +
>> +    def test_list_files_kernel_like_tree(self):
>> +        self.make_kernel_like_tree()
>> +        self.run_bzr('add')
>> +        tree = WorkingTree.open('.')
>> +        self.time(list, tree.list_files())
>> +
>> +    def test_list_files_unknown_kernel_like_tree(self):
>> +        self.make_kernel_like_tree()
>> +        tree = WorkingTree.open('.')
>> +        for root, dirs, files in os.walk('.'):
>> +            if '.bzr' in dirs:
>> +                dirs.remove('.bzr')
>> +            if root == '.':
>> +                continue
>> +            tree.add(root)
>> +        self.time(list, tree.list_files())
>> +
> 
> These could do with some explanation - the second one specifically I had
> to read twice to determine what code path you are hoping to expose
> issues in.

Sure. I added this comment:
  # Bzr only traverses directories if they are versioned
  # So add all the directories, but not the files, yielding
  # lots of unknown files.

> 
>> === modified file 'BRANCH.TODO'
>> --- BRANCH.TODO	
>> +++ BRANCH.TODO	
>> @@ -1,3 +1,5 @@
>>  # This file is for listing TODOs for branches that are being worked on.
>>  # It should ALWAYS be empty in the mainline or in integration branches.
>>  # 
>> +
>> +Test osutils.file_kind for more than just regular files and directories
> 
> 
> This should either be done, or put in TODO before merging.
> 

I'll go ahead and do it. It was a reminder for myself anyway.

> 
> 
> 
>> ...        
>> +        # If the name changes, or the parent_id changes, we have a rename
>> +        # (if we move a parent, that doesn't count as a rename for the file)
>> +        if (old_entry.name != new_entry.name 
>> +            or old_entry.parent_id != new_entry.parent_id):
>> +            delta.renamed.append((old_path,
>> +                                  new_path,
>> +                                  old_entry.file_id, old_entry.kind,
>> +                                  text_modified, meta_modified))
>> +        elif text_modified or meta_modified:
>> +            delta.modified.append((new_path, new_entry.file_id, new_entry.kind,
>> +                                   text_modified, meta_modified))
>> +        elif want_unchanged:
>> +            delta.unchanged.append((new_path, new_entry.file_id, new_entry.kind))
> 
> 
> We have a method on InventoryEntry that is very close to the logic here.
> Maybe we should have a TODO to combine/reuse this logic.

Are you talking about '_unchanged'? It seems to solve a different problem.

> 
>> +
>> +    def handle_old(path, entry):
>> +        """old entry without a new entry match"""
>> +        if entry.file_id in added:
>> +            # Actually this is a rename, we found a new file_id earlier
>> +            # at a different location, so it is no-longer added
>> +            x_new_path, x_new_entry = added.pop(entry.file_id)
>> +            check_matching(path, entry, x_new_path, x_new_entry)
>> +        else:
>> +            # We have an old_file_id which doesn't line up with a new_file_id
>> +            # So this file looks to be removed
>> +            assert entry.file_id not in removed
>> +            removed[entry.file_id] = path, entry
>> +
>> +    def handle_new(path, entry):
>> +        """new entry without an old entry match"""
>> +        if entry.file_id in removed:
>> +            # We saw this file_id earlier at an old different location
>> +            # it is no longer removed, just renamed
>> +            x_old_path, x_old_entry = removed.pop(entry.file_id)
>> +            check_matching(x_old_path, x_old_entry, path, entry)
>> +        else:
>> +            # We have a new file which does not match an old file
>> +            # mark it as added
>> +            assert entry.file_id not in added
>> +            added[entry.file_id] = path, entry
> 
> These two methods could use a little more exposition in the docstring I
> think. (I.e. 'This method is called when a file appears to have been
> deleted. It can converted what we thought was an added file to a rename,
> as we gain information about what files are where.')
> 
> ...

I added more to the docstring.

> 
>> +        delta.added.append((new_path, new_entry.file_id, new_entry.kind))
>> +
>>      delta.removed.sort()
>>      delta.added.sort()
>>      delta.renamed.sort()
> 
> Do we need to sort still ? (If we are adding in alphabetical order ?)
> 

Renames are done in a semi-random order (the old code sorted (old_path,
new_path)).
And removed and added are done in dictionary key order, so they need to
be sorted.

But 'modified' and 'unchanged' shouldn't need to be sorted.
Is there anyway to just assert that a list is sorted for testing purposes?

I added them as a TODO for now.

...

> for more tuning though:
> def file_kind(f, _formats=_formats, _unknown='unknown'):
>     return _formats.get(os.lstat(f).st_mode & 0170000, _unknown)
> 
> AIUI this should be about as fast as you can get: the use of keyword
> parameters to pass _formats and _unknown makes them into locals rather
> than globals, so they get compiled into the bytecode, and not allocated
> each time.
> 
> Its possible the try: except: is actually faster than the get() call -
> but even there the keyword parameter approach should be a benefit.

The keyword parameters were indeed a benefit, and one that I was looking
for. (I had forgotten about it, and was trying to figure out how to make
them locals). They aren't allocated, (versus if you made them real
locals), but more importantly, they aren't double looked up (if you made
them globals, it looks them up in locals() fails, and looks them up in
globals()).

try/except is actually slightly faster (about 5%). I'm guessing the
major benefit is that we don't ever have anything unknown.

I did a slightly different one:
def file_kind(f, _formats=_formats, _unknown='unknown', _lstat=os.lstat):
    try:
        return _formats[_lstat(f).st_mode & 0170000]
    except KeyError:
        return _unknown

These are the timings that I measured (run twice take lowest):

original: 302ms
globals version: 274ms
try/except: 263ms
get(..., 'unknown'): 271ms

Honestly, they are probably all within the noise margin at this point. I
would have to run them a lot more than twice to detect real differences.
But all of the optimized ones were better than the original, and
whatever variation of using local variables was better than the global
versions.

> 
> +1 - the only Must is the change to Branch.TODO, everything else is
> advisory.
> 
> Rob

Honestly, I think there isn't a lot of optimization that we can do to
file_kind at this point. What would help more would be to have a stat
cache. I don't know how expensive it would be, but our 'is_executable'
check does a stat, as does 'file_kind', and as does 'get_sha1' (through
the hashcache fingerprint).

I'm not sure what the best thing would be. It might be to make
everything go through WorkingTree, which can then maintain a cache as
long as a lock is held. It could create in-memory inventory entries on
demand, and they wouldn't check the on-disk files more than once.

Then we could also benefit from doing more of a in-inode-order scan.
Maybe something where we determine what files we are going to be
interested in, and then ask WorkingTree to do an in-order scan.
It really should only touch a limited set of files upon request. So that
we can make 'bzr commit just/one/file" really fast in a kernel sized tree.

It would require quite a few changes, since list_files has to touch
everything. I'm thinking we could have a separate code path in
compare_trees when specific_files is nonempty. Because it also involves
special casing renames, since we haven't looked at all of the inventory
on both sides. (a file that has been moved looks like it has been
deleted, etc.)
We could create a 'list_specific_files' function, which could be smarter
about not iterating over the entire tree.
There are some edge cases (like bzr commit .) that would probably be
served better by the standard list_files. And we have to be aware that
"bzr status foo/ foo/bar foo/bar/.." has a strong chance of trying to
list the same files more than once.
But the performance implications on large trees seem like it would be
worth investigating. (And whenever we get a per-directory inventory, it
has the advantage of not needing to parse all of the old inventory).

John
=:->

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 249 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060530/b5216b2a/attachment.pgp