[MERGE][RFC] further add performance improvements
John A Meinel
john at arbash-meinel.com
Wed May 24 05:16:59 BST 2006
John A Meinel wrote:
...
>> Well, I don't know what rio format you were choosing to use, but in my
>> test, I found that rio is slower than cElementTree at reading, but can
>> be faster at writing.
>>
>> I have a plugin available from here:
>> http://bzr.arbash-meinel.com/plugins/rio_inventory
>
> ...
>
>> I'm also going to be looking into the effect on Knits of the new format.
>>
>
> I tested creating a Knit with my rio inventory format. With the new
> format, these are the basic results (this involves re-adding all of my
> inventories in my bzr repository):
>
> processed 5619 inventories
>
> extracted in 26.71s (4.8ms avg)
> Times read (s) read avg (ms) write (s) write avg (ms)
> xml: 71.82 12.8 270.07 48.1
> rio: 252.53 44.9 117.99 21.0
> ratio: 3.5160 0.4369 1.0837
> Times add (s) extract(s)
> xml: 98.38 18.55
> rio: 272.68 33.37
> ratio: 2.77 1.80
>
> sizes: xml.knit rio.knit ratio
> 6880558 5693021 0.827
>
> The add and extract times are so much longer because knit (and weave)
> are line oriented, and now every attribute is another line, rather than
> being more information on the same line.
>
> But one very nice this is the 82.7% extra compression.
>
> I'm going to look into a format like my sax work, to see about a 2-line
> inventory style.
>
> John
> =:->
I was very surprised to see that switching to a 2-line format
drastically improved the performance.
$ time bzr rio-test
processed 5621 inventories
extracted in 27.70s (4.9ms avg)
Times read (s) read avg (ms) write (s) write avg (ms)
xml: 71.36 12.7 267.69 47.6
rio: 182.37 32.4 123.59 22.0
ratio: 2.5555 0.4617 0.9024
Times add (s) extract(s)
xml: 98.53 17.63
rio: 112.32 21.47
ratio: 1.14 1.22
sizes: xml.knit rio.knit ratio
6881208 5813188 0.845
This change actually dropped the time by 5 minutes. We lose a little bit
of compression, but not much. And our read time is within a factor of 3
of a C XML implementation. (Our write time is about the same).
The 2-line inventory is very similar to rio, only using '\t' instead of
'\n' as the delimiter. (And thus not allowing \n or \t in the value).
A little bit more performance can be extracted by delaying the decode to
unicode. Part of the reason is that we aren't decoding the tags or any
meta characters.
$ time bzr rio-test
processed 5623 inventories
extracted in 26.99s (4.8ms avg)
Times read (s) read avg (ms) write (s) write avg (ms)
xml: 70.77 12.6 267.36 47.5
rio: 167.32 29.8 122.23 21.7
ratio: 2.3643 0.4572 0.8563
Times add (s) extract(s)
xml: 98.05 17.54
rio: 112.11 21.24
ratio: 1.14 1.21
sizes: xml.knit rio.knit ratio
6881748 5813586 0.845
Just as a simple test, I went ahead tried an inventory as a single line
entry. And performance jumped again. Naturally, the file size is almost
identical to the XML version, the read time drops to almost exactly 2x,
the write time drops underneath 0.4x, and the add & extract times are
unchanged.
So I'm guessing we might want to revisit the implementation of rio.
Having continuation lines really hurts its performance. Or at least, I
was able to double its performance by using this:
for group in line.split('\t'):
key, value = group.split(': ', 1)
info[key] = value
def pack_line(info):
out = []
for tag, value in info:
out.append(tag + ': ' + value.encode('utf-8'))
return '\t'.join(out)
Now, my version doesn't support all characters in 'value'. It doesn't
support multiple lines, etc. But it seems to be faster.
$ time bzr rio-test
processed 5623 inventories
extracted in 27.99s (5.0ms avg)
Times read (s) read avg (ms) write (s) write avg (ms)
xml: 70.17 12.5 266.97 47.5
rio: 143.86 25.6 108.48 19.3
ratio: 2.0501 0.4063 0.7485
Times add (s) extract(s)
xml: 97.53 17.47
rio: 98.59 17.52
ratio: 1.01 1.00
sizes: xml.knit rio.knit ratio
6881748 6844340 0.995
(as a reference the plain rio format with one piece of info per line was
3.5x slower than cElementTree at reading. The 1-line format using
'\t'.join() in the inner loop is 2x slower, or almost 2x faster than
plain rio).
John
=:->
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 249 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060523/4e31c28f/attachment.pgp
More information about the bazaar
mailing list