[MERGE][RFC] further add performance improvements

Wed May 24 05:16:59 BST 2006

John A Meinel wrote:

...

>> Well, I don't know what rio format you were choosing to use, but in my
>> test, I found that rio is slower than cElementTree at reading, but can
>> be faster at writing.
>>
>> I have a plugin available from here:
>> http://bzr.arbash-meinel.com/plugins/rio_inventory
> 
> ...
> 
>> I'm also going to be looking into the effect on Knits of the new format.
>>
> 
> I tested creating a Knit with my rio inventory format. With the new
> format, these are the basic results (this involves re-adding all of my
> inventories in my bzr repository):
> 
> processed 5619 inventories
> 
> extracted in 26.71s (4.8ms avg)
> Times   read (s)        read avg (ms)   write (s)       write avg (ms)
> xml:       71.82            12.8          270.07            48.1
> rio:      252.53            44.9          117.99            21.0
> ratio:    3.5160                          0.4369          1.0837
> Times   add (s) extract(s)
> xml:      98.38    18.55
> rio:     272.68    33.37
> ratio:     2.77     1.80
> 
> sizes:  xml.knit  rio.knit  ratio
>          6880558   5693021  0.827
> 
> The add and extract times are so much longer because knit (and weave)
> are line oriented, and now every attribute is another line, rather than
> being more information on the same line.
> 
> But one very nice this is the 82.7% extra compression.
> 
> I'm going to look into a format like my sax work, to see about a 2-line
> inventory style.
> 
> John
> =:->

I was very surprised to see that switching to a 2-line format
drastically improved the performance.

$ time bzr rio-test
processed 5621 inventories

extracted in 27.70s (4.9ms avg)
Times   read (s)        read avg (ms)   write (s)       write avg (ms)
xml:       71.36            12.7          267.69            47.6
rio:      182.37            32.4          123.59            22.0
ratio:    2.5555                          0.4617          0.9024

Times  add (s) extract(s)
xml:     98.53    17.63
rio:    112.32    21.47
ratio:    1.14     1.22

sizes:  xml.knit  rio.knit  ratio
         6881208   5813188  0.845

This change actually dropped the time by 5 minutes. We lose a little bit
of compression, but not much. And our read time is within a factor of 3
of a C XML implementation. (Our write time is about the same).

The 2-line inventory is very similar to rio, only using '\t' instead of
'\n' as the delimiter. (And thus not allowing \n or \t in the value).

A little bit more performance can be extracted by delaying the decode to
unicode. Part of the reason is that we aren't decoding the tags or any
meta characters.

$ time bzr rio-test
processed 5623 inventories

extracted in 26.99s (4.8ms avg)
Times   read (s)        read avg (ms)   write (s)       write avg (ms)
xml:       70.77            12.6          267.36            47.5
rio:      167.32            29.8          122.23            21.7
ratio:    2.3643                          0.4572          0.8563

Times  add (s) extract(s)
xml:     98.05    17.54
rio:    112.11    21.24
ratio:    1.14     1.21

sizes:  xml.knit  rio.knit  ratio
         6881748   5813586  0.845

Just as a simple test, I went ahead tried an inventory as a single line
entry. And performance jumped again. Naturally, the file size is almost
identical to the XML version, the read time drops to almost exactly 2x,
the write time drops underneath 0.4x, and the add & extract times are
unchanged.

So I'm guessing we might want to revisit the implementation of rio.
Having continuation lines really hurts its performance. Or at least, I
was able to double its performance by using this:

for group in line.split('\t'):
    key, value = group.split(': ', 1)
    info[key] = value

def pack_line(info):
    out = []
    for tag, value in info:
        out.append(tag + ': ' + value.encode('utf-8'))
    return '\t'.join(out)

Now, my version doesn't support all characters in 'value'. It doesn't
support multiple lines, etc. But it seems to be faster.

$ time bzr rio-test
processed 5623 inventories

extracted in 27.99s (5.0ms avg)
Times   read (s)        read avg (ms)   write (s)       write avg (ms)
xml:       70.17            12.5          266.97            47.5
rio:      143.86            25.6          108.48            19.3
ratio:    2.0501                          0.4063          0.7485

Times   add (s) extract(s)
xml:      97.53    17.47
rio:      98.59    17.52
ratio:     1.01     1.00

sizes:  xml.knit  rio.knit  ratio
         6881748   6844340  0.995

(as a reference the plain rio format with one piece of info per line was
3.5x slower than cElementTree at reading. The 1-line format using
'\t'.join() in the inner loop is 2x slower, or almost 2x faster than
plain rio).
John
=:->

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 249 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060523/4e31c28f/attachment.pgp