[brisbane-core MERGE] CHKInventory.iter_non_root_entries()
John Arbash Meinel
john at arbash-meinel.com
Fri Mar 6 21:10:38 GMT 2009
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Ian Clatworthy wrote:
> So fast-import now kind of works for CHK repo formats, give or
> take 'deleteall' directives (as used by e.g. darcs-fast-export).
> Unfortunately though, it's rather slow at the moment - about
> 5-7 times slower (for gc-chk255) than importing into a pack repo.
> And pack importing is *not* fast - it's about 10 times slower
> than git-fast-import or hg-fast-import from what I'm told. :-(
> Time for some profiling ...
>
> It seems that operations that walk directories in CHKInventories
> are slow, e.g. directories() and iter_changes(). As it turns out
> though, I don't *need* the path returned by iter_changes() in
> fast-import - just the inventories entries while loading texts.
>
> The attached patch adds a new method to CHKInventory called
> iter_non_root_entries(). Using it instead of iter_entries() cuts
> the fast-import time for gc-chk255 by half. Hooray.
>
> Ian C.
>
I don't think that CHKInventory has a specialized 'iter_entries' or
'iter_entries_by_dir' cases, which may be part of the problem you are
seeing.
I think we could do a custom implementation, which would have the first
loop be over parent_id, basename to file_id, in order to start building
up the path, and then probe into id_to_entry to get the actual contents.
As one possible example:
cur_dir_id = self.root_id
cur_path = ''
pending = [(cur_path, cur_dir_id)]
pid_basename = self.parent_id_basename_to_file_id
while pending:
cur_path, cur_dir_id = pending.pop()
children = {}
children = [file_id for key, file_id
in pid_basename.iteritems([(cur_dir_id,)])]
# check _entry_cache first
...
# Now find the remaining entries
entries = [entry for key, entry
in self.id_to_entry.iteritems(children)}
for entry in entries:
path = cur_path + '/' + entry.basename
if entry.kind == 'directory':
pending.append((path, entry.file_id))
yield path, entry
With a couple tweaks this would be exactly 'iter_entries_by_dir' order.
Anyway, if you did the iteritems() tweak that you posted previously,
iteritems(key_filter=XXX) would have been scanning all pages, rather
than just a subset, which may have accounted for your processing overhead.
That said, if computing 'path' is a significant overhead that you then
ignore, I think it is reasonable to add an api that doesn't generate
information you don't care about.
John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iEYEARECAAYFAkmxkU4ACgkQJdeBCYSNAANghgCfUFr8PKpztdGDXVURg+SIPi9a
U0oAmgMj0L7W3R5QRqjanpgTrEnP8Aho
=9H1A
-----END PGP SIGNATURE-----
More information about the bazaar
mailing list