[BUG] serializer.write_inventory() != write_inventory_to_string()

Thu Oct 20 17:18:24 BST 2005

Martin Pool wrote:
> On 20/10/05, John Arbash Meinel <john at arbash-meinel.com> wrote:
> 
>>It seems there is a small bug in how we write our xml out to a file.
> 
> 
> I saw the missing \n the other day but hadn't looked at it yet.
> 
> I agree we should make them consistent and both should have the final newline.
> 
> By the way did you reach any conclusion about writing out xml directly
> rather than through cElementTree.  I mean did you want to do it, or
> think it was a bad idea, or not important.  I know there is the
> ability to insert more newlines and so get more delta compression, but
> that may be two-edged for performance.

Short version:
I don't think it is faster to write the xml out directly. Neither is
reading it in using iterparse, though memory consumption should decrease
(slightly?).
However, with my last version of the print reformatting, I was able to
achieve 900k vs 1400k for inventory.weave, with only a 10% increase in
the total number of lines.
I was also able to get a version of a revision.weave which was just
smaller than the apparent size (< 1MB).

So I think it might indeed be worth something, just don't expect it to
be much faster. (Though in theory you would decrease loading time. And
if we switched to an indexed weave, both revision.weave and
inventory.weave could benefit from it. You download one small index
file, and then whatever chunks you are missing from the .weave)

Longer version:

For my bzr-sax branch, I do believe that performance is pretty much a
wash. I think cElementTree / ElementTree pretty much just "print" out
the strings anyway. And when reading them back in, "iterparse" does
exactly the same work, it just gives you an opportunity to clear out the
elements as you go. (most of the work is done in expat anyway).
So using iterparse would mean that you could save on memory consumption,
because it doesn't build the entire tree, and then return, but it
doesn't save you much, because each little step is identical.

So from a performance standard, it is pretty much a wash, and not worth
changing how we serialize to xml.

On the other hand, as you mentioned, it does let us control the layout
of the xml. My first experiments exposed the weakness in the weave
format to growing too long because I put *every* attribute on a line of
it's own.

Then I realized, for inventories, about half of them are very
consistent. file_id, name, parent_id, etc. Some of them can change, but
extremely rarely. The others, text_sha1, text_size, etc. change very
frequently.

What I did was to break up each inventory entry into 2 lines. The first
one is for the attributes that don't change often, and the second one is
for the attributes that do. This gives us 99% of the improved
compression, while simultaneously limiting the total number of text
lines. It also means it limits the total number of control lines. Before
you would have a problem that you might have 1 line the same, then 1
changed, then 3 the same, then 2 changed, which would be a total of
2*4=8 control lines {} [].

I was able to get decent compression out of it (900k vs 1.4MB for the
bzr.dev of the time) I believe the new total lines was 27k vs 25k, so
not a large increase.

I think something similar could be done for the revision texts. I was
able to get a revision.weave file to be just less than the apparent
filesystem size.

John
=:->

> 
> --
> Martin
> 
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 256 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20051020/95e890de/attachment.pgp