[rfc] lazy inventory serialization

Mon Sep 4 23:53:34 BST 2006

As I'm looking at our performance, one thing that is interesting is how
much time it is taking for us to read and write inventories. I know one
thought is to change the storage format, so that serialization and
deserialization don't take as long. But I'm wondering if we can really
do a whole lot better. As I recall, cElementTree is actually quite fast
at parsing XML, and I wasn't really able to do any better with a custom
format, that still went through the Serialization code.

What I'm thinking, is that for a large tree, when you do a partial
commit, we are actually parsing a whole lot of inventory entries, and
then serializing them back to XML form to write out the new inventory.

And in the case of the freebsd ports tree, that amounts to parsing 25K
entries (eventually it will be 100k entries), only to modify <10, and
then write all that data back out again.

So regardless of the final format, this is going to be inefficient. One
of the things I found with dirstate, is that I don't try to create
objects, because it is a lot faster to leave everything as a list of
strings.

I'm thinking that one thing we could do, would be to make Inventory
objects lazy. (Possibly by making InventoryEntry objects lazy). We have
to take some care, because we don't want to tightly bind the Inventory
objects to their serialized form. So we probably need to pass in a
serializer object, and have a way that we can fast-path if the source
and target are the same formats, but have a full extract and recreate in
case the formats differ.

I'm not sure exactly how we could make it work, because at some level,
you need to know about what directories exist, and what files exist, and
possibly what file ids exist, so that callers can place requests based
on either paths or on file ids. (If they place a call based on a path,
you could lazily load the path segments, but calls based on file ids
don't have that lazy property).

If you profile one of these big commits, you can see that a large
portion of the time is spent in _unpack_entry (about 1/3rd of the time).
Now, I know that things are a little bit skewed under lsprof, but this
is the top of the profile:
   CallCount    Recursive    Total(ms)   Inline(ms)
		   module:lineno(function)
       89455            0      9.7154      3.9563
		   bzrlib.xml5:231(_unpack_entry)
     +572575            0      2.0235      2.0235
		   +<built-in method get>
      +71585            0      1.7783      0.7971
		   +bzrlib.inventory:659(__init__)
      +53673            0      1.1111      0.5922
		   +bzrlib.cache_utf8:62(get_cached_unicode)
      +89455            0      0.3204      0.3204
		   +bzrlib.inventory:326(versionable_kind)
      +17870            0      0.5258      0.2790
		   +bzrlib.inventory:549(__init__)

           2            0      4.2588      0.0148
		   bzrlib.xml_serializer:55(read_inventory)
          +2            0      3.9629      0.3433
		   +bzrlib.xml5:210(_unpack_inventory)
          +2            0      0.2811      0.0001
		   +bzrlib.xml_serializer:74(_read_element)

           1            0      4.0968      0.1522
		   bzrlib.xml5:106(write_inventory)
      +17891            0      3.4508      1.4574
		   +bzrlib.xml5:139(_append_entry)
      +17893            0      0.4101      0.2922
		   +bzrlib.inventory:898(iter_entries)
          +1            0      0.0382      0.0382
		   +<method 'writelines' of 'cStringIO.StringO' objects>

       17891            0      3.4508      1.4574
		   bzrlib.xml5:139(_append_entry)
     +253958            0      0.8032      0.8032
		   +<method 'append' of 'list' objects>
      +71508            0      1.0633      0.6649
		   +bzrlib.xml5:69(_encode_and_escape)
      +17891            0      0.0660      0.0660
		   +bzrlib.inventory:326(versionable_kind)
      +17835            0      0.0609      0.0609
		   +<isinstance>

This is snippets of lsprof for committing a change to 1 file. Now we
read 5 inventories at various times, so we know there is work that can
be done to improve that. We only *need* to read 2, the working inv, and
the basis inv. Robert is working on trying to fix that.

So the first thing to notice is that cElementTree objects are quite
speedy. We may 572,000 calls to 'get()' and it only takes 2s. By
comparison, we create only 71,000 Inventory objects, and it takes almost 2s.

Looking closely, I can see that ElementTree.parse() only takes about
140ms for 18K files. (this is the time spent it
Serializer._read_element). We spend 4s doing a 'write_inventory'. So
even with the new serializer, it still takes us quite a while to write
out. The old one was even worse, but I can think of a few things to try
to make it faster still.

So because the big expenses are all the time spent extracting every
little tidbit about every file and directory, I think we could do fairly
well if we read in the whole text (140ms, dirstate doesn't even do much
better for this many files), figured out what the contents are, and what
file ids exist, and who their parent is. That would be 1 .tag request,
and 2 .get() requests for each record, rather than the current 14 get()
requests for every record.

This doesn't leave the raw text cached, though. Though we could change
the serializer, so we iterate through the lines of the file, as we are
iterating through the elements in the tree.

This might be better just waiting for the next file format. But I
honestly don't think XML is as horrible as we thought it was. I think
cElementTree.parse() is quite fast. And if we do things right, we don't
have to pay as much of a penalty in creating all of the Inventory
objects in memory.

John
=:->

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 254 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060904/dfc4d4f9/attachment.pgp