[MERGE][RFC] Add simple revision serializer based on RIO.

John Arbash Meinel john at arbash-meinel.com
Thu May 7 18:07:00 BST 2009


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Jelmer Vernooij wrote:
> Hi Ian,
> 
> Ian Clatworthy wrote:
>> Ian Clatworthy has voted comment. Status is now: Semi-approved
>> Comment: A performance improvement would be one of the main reasons
>> for including this or otherwise. I'd like to see this patch
>> resubmitted so that this serialiser is used in a proposed new
>> development format. (At first glance, it looks like your changing
>> the behaviour of an existing format?) I can then more easily
>> benchmark its impact, report the results and vote accordingly.
> The attached patch adds "development7-rich-root" that is basically
> "development6-rich-root" that uses the RIO revision serializer. I
> don't have any experience adding new repository formats, so I hope
> I've updated all the right places. The testsuite passes, and I've at
> least been able to manually initialize new repositories of this kind
> and verified that they're actually using RIO (using "bzr cat-revision").

...

Looking over it:

1) You don't have an embedded 'format' field, which we have kept for all
XML formats. I think it would probably be good to continue having one.

2) My quick test was to create a conversion of bzrtools + the first 1k
revs of mysql. Interestingly, the size of the compressed revision texts
are identical within my resolution:
$ wbzr repository-details d6rr-ext

Commits: 2159
                      Raw    %    Compressed    %  Objects
Revisions:       4533 KiB   0%       738 KiB   7%     2159

$ wbzr repository-details d7rr/
Commits: 2159
                      Raw    %    Compressed    %  Objects
Revisions:       4255 KiB   0%       738 KiB   7%     2159

The raw size is *slightly* smaller.

3) As *I* expected, RIO deserialization seems to be slower than XML
deserialization.

Using:

  TIMEIT -s "b = branch.Branch.open('.'); b.lock_read()" \
	 -s "anc = b.repository.get_ancestry(b.last_revision())" \
         -s "anc.pop(0) #None" \
         "revs = list(b.repository.get_revisions(anc))"

For bzrtools w/1116 revs, the difference is:
  XML: 10 loops, best of 3: 105 msec per loop
  RIO: 10 loops, best of 3: 205 msec per loop

For mysql w/1043 revs, but a *lot* of per-file merge info
  XML: 10 loops, best of 3: 147 msec per loop
  RIO: 10 loops, best of 3: 3.24 sec per loop

So for small texts, RIO is 2x slower than XML, for *large* texts, I'm
looking at 22x slower.


A long while back I looked at using RIO for inventory texts, and found
that it was *really* hard to beat the cElementTree XML decoder.
Potentially, we could go to a C implementation of RIO deserialization,
and I would imagine that could be ~ the same speed as XML.

Note, I did find a relatively simple tweak to the RIO reader that
doesn't change the times for bzrtools, but makes a huge difference for
mysql. (Build up the current value into a list, rather than into a
string that must be re-allocated.)

So that gets us to:
  XML: 10 loops, best of 3: 147 msec per loop
  RIO: 10 loops, best of 3: 370 msec per loop

A fairly flat 2x slower than our current XML formats.

4) I'm pretty sure I could write a decent C parser for RIO, though it
wouldn't quite conform to the current api. The current api expects an
'iterable of lines', which is ok, but means that all multi-line
constructs end up getting parsed into strs, then decoded from UTF-8,
then merged back together into a large unicode string.

Actually, come to think of it, because you have to explicitly *remove*
bytes from multi-line entries, you can't really do it particularly
efficiently. I suppose you could find the next '\n' then "peek" to see
if the next char is '\t' to know that this was actually a continuation,
then keep going until you've found the whole value, then come back and
allocate your final array, and copy in the various bits.

Certainly less efficient than something like length-prefixed strings...

Also, handling properties as a nested Stanza works, but is a bit odd
when you start getting into multi-line properties.
I wonder if we would want to do something like JSON instead...

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkoDFTQACgkQJdeBCYSNAAP3sQCfY/kL0AveBq/qX8YPE0aKcLim
nw4AoIsFcup3IgxOd0InimSHMg8CQ/fP
=BZfo
-----END PGP SIGNATURE-----
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: rio-serializer3.diff
Url: https://lists.ubuntu.com/archives/bazaar/attachments/20090507/5d22009f/attachment-0001.diff 


More information about the bazaar mailing list