[brisbane-core MERGE] CHKInventory.iter_non_root_entries()

Fri Mar 6 21:10:38 GMT 2009

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Ian Clatworthy wrote:
> So fast-import now kind of works for CHK repo formats, give or
> take 'deleteall' directives (as used by e.g. darcs-fast-export).
> Unfortunately though, it's rather slow at the moment - about
> 5-7 times slower (for gc-chk255) than importing into a pack repo.
> And pack importing is *not* fast - it's about 10 times slower
> than git-fast-import or hg-fast-import from what I'm told. :-(
> Time for some profiling ...
> 
> It seems that operations that walk directories in CHKInventories
> are slow, e.g. directories() and iter_changes(). As it turns out
> though, I don't *need* the path returned by iter_changes() in
> fast-import - just the inventories entries while loading texts.
> 
> The attached patch adds a new method to CHKInventory called
> iter_non_root_entries(). Using it instead of iter_entries() cuts
> the fast-import time for gc-chk255 by half. Hooray.
> 
> Ian C.
> 

I don't think that CHKInventory has a specialized 'iter_entries' or
'iter_entries_by_dir' cases, which may be part of the problem you are
seeing.

I think we could do a custom implementation, which would have the first
loop be over parent_id, basename to file_id, in order to start building
up the path, and then probe into id_to_entry to get the actual contents.

As one possible example:

cur_dir_id = self.root_id
cur_path = ''
pending = [(cur_path, cur_dir_id)]
pid_basename = self.parent_id_basename_to_file_id
while pending:
  cur_path, cur_dir_id = pending.pop()
  children = {}
  children = [file_id for key, file_id
              in pid_basename.iteritems([(cur_dir_id,)])]
  # check _entry_cache first
  ...
  # Now find the remaining entries
  entries = [entry for key, entry
             in self.id_to_entry.iteritems(children)}
  for entry in entries:
    path = cur_path + '/' + entry.basename
    if entry.kind == 'directory':
      pending.append((path, entry.file_id))
    yield path, entry

With a couple tweaks this would be exactly 'iter_entries_by_dir' order.

Anyway, if you did the iteritems() tweak that you posted previously,
iteritems(key_filter=XXX) would have been scanning all pages, rather
than just a subset, which may have accounted for your processing overhead.

That said, if computing 'path' is a significant overhead that you then
ignore, I think it is reasonable to add an api that doesn't generate
information you don't care about.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkmxkU4ACgkQJdeBCYSNAANghgCfUFr8PKpztdGDXVURg+SIPi9a
U0oAmgMj0L7W3R5QRqjanpgTrEnP8Aho
=9H1A
-----END PGP SIGNATURE-----