[MERGE][RFC] Add simple revision serializer based on RIO.
John Arbash Meinel
john at arbash-meinel.com
Thu May 7 18:07:00 BST 2009
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Jelmer Vernooij wrote:
> Hi Ian,
>
> Ian Clatworthy wrote:
>> Ian Clatworthy has voted comment. Status is now: Semi-approved
>> Comment: A performance improvement would be one of the main reasons
>> for including this or otherwise. I'd like to see this patch
>> resubmitted so that this serialiser is used in a proposed new
>> development format. (At first glance, it looks like your changing
>> the behaviour of an existing format?) I can then more easily
>> benchmark its impact, report the results and vote accordingly.
> The attached patch adds "development7-rich-root" that is basically
> "development6-rich-root" that uses the RIO revision serializer. I
> don't have any experience adding new repository formats, so I hope
> I've updated all the right places. The testsuite passes, and I've at
> least been able to manually initialize new repositories of this kind
> and verified that they're actually using RIO (using "bzr cat-revision").
...
Looking over it:
1) You don't have an embedded 'format' field, which we have kept for all
XML formats. I think it would probably be good to continue having one.
2) My quick test was to create a conversion of bzrtools + the first 1k
revs of mysql. Interestingly, the size of the compressed revision texts
are identical within my resolution:
$ wbzr repository-details d6rr-ext
Commits: 2159
Raw % Compressed % Objects
Revisions: 4533 KiB 0% 738 KiB 7% 2159
$ wbzr repository-details d7rr/
Commits: 2159
Raw % Compressed % Objects
Revisions: 4255 KiB 0% 738 KiB 7% 2159
The raw size is *slightly* smaller.
3) As *I* expected, RIO deserialization seems to be slower than XML
deserialization.
Using:
TIMEIT -s "b = branch.Branch.open('.'); b.lock_read()" \
-s "anc = b.repository.get_ancestry(b.last_revision())" \
-s "anc.pop(0) #None" \
"revs = list(b.repository.get_revisions(anc))"
For bzrtools w/1116 revs, the difference is:
XML: 10 loops, best of 3: 105 msec per loop
RIO: 10 loops, best of 3: 205 msec per loop
For mysql w/1043 revs, but a *lot* of per-file merge info
XML: 10 loops, best of 3: 147 msec per loop
RIO: 10 loops, best of 3: 3.24 sec per loop
So for small texts, RIO is 2x slower than XML, for *large* texts, I'm
looking at 22x slower.
A long while back I looked at using RIO for inventory texts, and found
that it was *really* hard to beat the cElementTree XML decoder.
Potentially, we could go to a C implementation of RIO deserialization,
and I would imagine that could be ~ the same speed as XML.
Note, I did find a relatively simple tweak to the RIO reader that
doesn't change the times for bzrtools, but makes a huge difference for
mysql. (Build up the current value into a list, rather than into a
string that must be re-allocated.)
So that gets us to:
XML: 10 loops, best of 3: 147 msec per loop
RIO: 10 loops, best of 3: 370 msec per loop
A fairly flat 2x slower than our current XML formats.
4) I'm pretty sure I could write a decent C parser for RIO, though it
wouldn't quite conform to the current api. The current api expects an
'iterable of lines', which is ok, but means that all multi-line
constructs end up getting parsed into strs, then decoded from UTF-8,
then merged back together into a large unicode string.
Actually, come to think of it, because you have to explicitly *remove*
bytes from multi-line entries, you can't really do it particularly
efficiently. I suppose you could find the next '\n' then "peek" to see
if the next char is '\t' to know that this was actually a continuation,
then keep going until you've found the whole value, then come back and
allocate your final array, and copy in the various bits.
Certainly less efficient than something like length-prefixed strings...
Also, handling properties as a nested Stanza works, but is a bit odd
when you start getting into multi-line properties.
I wonder if we would want to do something like JSON instead...
John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iEYEARECAAYFAkoDFTQACgkQJdeBCYSNAAP3sQCfY/kL0AveBq/qX8YPE0aKcLim
nw4AoIsFcup3IgxOd0InimSHMg8CQ/fP
=BZfo
-----END PGP SIGNATURE-----
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: rio-serializer3.diff
Url: https://lists.ubuntu.com/archives/bazaar/attachments/20090507/5d22009f/attachment-0001.diff
More information about the bazaar
mailing list