[merge] cache encoding

Thu Aug 10 15:07:45 BST 2006

Attached is a bundle that caches encode/decode from and to utf8. The
biggest application for this is the fact that when you commit a new
kernel tree, it has to annotate every line in the tree with the current
revision. The specific location that I saw was this line in knit

return ['%s %s' % (o.encode('utf-8'), t) for o, t in content._lines]

So basically, it was doing a new encode for *every* line. Which with a
new kernel tree, you have 7.7M lines. This doesn't account for a huge
portion of the overall time (only about 45s/10min). But it doesn't hurt
to do it faster.

The attached patch also uses the same caches so that file ids and
revision ids are cached in memory. Sort of like using intern(), only
intern() doesn't work on unicode strings.

I also added some benchmarks. And on my machine, to do 1M foo.decode()
calls takes 4s, but 1M calls to decode_utf8() lookups takes about 800ms.
foo.encode() is faster, only taking 1.5-2.5s, but encode_utf8() also
takes only 800ms.

I originally was going to look into having Knit objects remember
everything in utf-8 form, rather than storing it in memory as Unicode.
But that required a lot more changes, especially since we want the
external view to not change. So it requires really going through the
api, and making sure things are what they seem.
(I also seem to recall finding bugs, such that KnitIndex doesn't really
encode/decode properly. So if we ever had real unicode revision ids, it
wouldn't work correctly)

I have also considered that we might want to switch from having a
revision_id be a Unicode string, into having it be a utf-8 bytestring.
We do a lot of encode/decode operations just to massage it into the on
disk-form, versus the in-memory form, when we really never need to
operate on it as a Unicode object. (We need Unicode filenames, because
of Windows and Mac API issues, though I would consider trying to only
use unicode as the last step, and use utf-8 internally)

I also think this can help a future serializer, because then it can used
the cached encode/decode when creating InventoryEntries, etc.
Theoretically, it could save memory when dealing with large trees, since
it doesn't have to store a separate string for every object. In my
limited 'branch' testing of the Samba tree. It cost a little bit of
memory (about 5MB or so). I assume because it has to store both utf8 and
unicode. And 'branch' doesn't really hang on to things anyway. Not as
much as say 'commit' (which reads 4 inventories, and creates a 5th).

Anyway, I figure I can get some feedback before I do any more work on
it. I'm sure it helps for Knits. But the cache could certainly be done
more locally, rather than a global encoding cache for 99% of the same
performance. I just thought it could be generally useful throughout the
code.

John
=:->
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cache-encode.patch
Type: text/x-patch
Size: 83236 bytes
Desc: not available
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060810/770024f4/attachment.bin 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 254 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060810/770024f4/attachment.pgp