How to handle extracting lots of Inventories

Wed Oct 8 16:24:40 BST 2008

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I've been looking at doing "bzr reconcile" and trying to figure out why
it is so slow. I won't say I've profiled it everywhere, but at least for
bzr.dev, when trying to "_generate_text_index" the time is primarily
spent in xml8.unpack_inventory.

When looking at it, we spend maybe 5s going from knit data to xml lines.
We then spend more than 30s going from XML lines to Inventory objects.

I believe the primary reason is because when we unpack 200 inventories,
we have to deserialize the same objects many, many times. For a 20k tree
with 10 changed items, we have to deserialize 19990 items that are
identical to the previous inventory.

I don't have a great answer for this yet, but it would certainly make
things like 'bzr reconcile' a lot faster. I know Robert is mulling about
how to improve parts of inventory, so I figured I should throw this data
point out.

r = repository.Repository.open('.')
r.lock_read()
all_revs = r.all_revision_ids()
pm = r.get_parent_map(all_revs)
ordered = tsort.topo_sort(pm)
for start in xrange(0, len(ordered), 100):
  list(r._iter_inventory_xmls(ordered[start:start+100))

With a chunk size of 100 it takes 55s here, with 200 it takes 42s.

Regardless, try changing that to

list(r.iter_inventories())

and you'll find it is *much* slower. I'm going to try to let it finish,
but so far I'm at 5m and only on 14k/26k.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkjs0LcACgkQJdeBCYSNAAPKFQCghaneJ5zkM9yC4egDTbp1XfLB
580An1lRQi6CKrFhdiyiXkvBXaynHOrf
=gf+h
-----END PGP SIGNATURE-----