[merge] cache encoding

Sun Aug 13 01:48:46 BST 2006

holger krekel wrote:
> On Thu, Aug 10, 2006 at 09:07 -0500, John Arbash Meinel wrote:
>> Attached is a bundle that caches encode/decode from and to utf8. The
>> biggest application for this is the fact that when you commit a new
>> kernel tree, it has to annotate every line in the tree with the current
>> revision. The specific location that I saw was this line in knit
>>
>> return ['%s %s' % (o.encode('utf-8'), t) for o, t in content._lines]
>>
>> So basically, it was doing a new encode for *every* line. Which with a
>> new kernel tree, you have 7.7M lines. This doesn't account for a huge
>> portion of the overall time (only about 45s/10min). But it doesn't hurt
>> to do it faster.
> 
> Ouch.  Btw, is there documentation on the general strategy how
> bzr deals with unicode?  It does not use the somewhat common scheme
> of "always use unicode, only convert at specified barriers", does it?
> 
> best,
> 
>     holger

Well, that is the specific barrier. When we write the lines out to save
them, we annotate each line with what version it came from. In the case
of a freshly added kernel tree, that is 7.7M lines to add.

In general, revision ids, file ids, and all bzr messages are Unicode
internally. And then when they are read/written they need to be encoded
into utf-8.

John
=:->

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 254 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060812/5c212e34/attachment.pgp