[merge] cache encoding
John Arbash Meinel
john at arbash-meinel.com
Sun Aug 13 01:48:46 BST 2006
holger krekel wrote:
> On Thu, Aug 10, 2006 at 09:07 -0500, John Arbash Meinel wrote:
>> Attached is a bundle that caches encode/decode from and to utf8. The
>> biggest application for this is the fact that when you commit a new
>> kernel tree, it has to annotate every line in the tree with the current
>> revision. The specific location that I saw was this line in knit
>> return ['%s %s' % (o.encode('utf-8'), t) for o, t in content._lines]
>> So basically, it was doing a new encode for *every* line. Which with a
>> new kernel tree, you have 7.7M lines. This doesn't account for a huge
>> portion of the overall time (only about 45s/10min). But it doesn't hurt
>> to do it faster.
> Ouch. Btw, is there documentation on the general strategy how
> bzr deals with unicode? It does not use the somewhat common scheme
> of "always use unicode, only convert at specified barriers", does it?
Well, that is the specific barrier. When we write the lines out to save
them, we annotate each line with what version it came from. In the case
of a freshly added kernel tree, that is 7.7M lines to add.
In general, revision ids, file ids, and all bzr messages are Unicode
internally. And then when they are read/written they need to be encoded
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 254 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060812/5c212e34/attachment.pgp
More information about the bazaar