[merge] cache encoding

Sun Aug 13 01:59:09 BST 2006

Martin Pool wrote:
> On 10 Aug 2006, John Arbash Meinel <john at arbash-meinel.com> wrote:
>> Attached is a bundle that caches encode/decode from and to utf8. The
>> biggest application for this is the fact that when you commit a new
>> kernel tree, it has to annotate every line in the tree with the current
>> revision. The specific location that I saw was this line in knit
>>
>> return ['%s %s' % (o.encode('utf-8'), t) for o, t in content._lines]
>>
>> So basically, it was doing a new encode for *every* line. Which with a
>> new kernel tree, you have 7.7M lines. This doesn't account for a huge
>> portion of the overall time (only about 45s/10min). But it doesn't hurt
>> to do it faster.
>>
>> The attached patch also uses the same caches so that file ids and
>> revision ids are cached in memory. Sort of like using intern(), only
>> intern() doesn't work on unicode strings.
>>
>> I also added some benchmarks. And on my machine, to do 1M foo.decode()
>> calls takes 4s, but 1M calls to decode_utf8() lookups takes about 800ms.
>> foo.encode() is faster, only taking 1.5-2.5s, but encode_utf8() also
>> takes only 800ms.
> 
> I haven't read the patch in detail but it sounds broadly plausible to
> cache those transformations.
> 
> I rather wish I'd insisted that revision ids be just restricted ascii in
> the first place so we could avoid more of these issues altogether.
> Would it be too late to at least require they're ascii?  Perhaps we can
> transition to doing that.  It's kind of pointless to spend time
> converting something which is likely not going to be Unicode at all.

Well, bzr itself has not been able to create anything but restricted
ascii revision ids and file ids for a while.

I guess we could get non-ascii revision ids if the user's email
contained a non-ascii character. The file-id generator removes
everything that matches '[^\w.]' which I believe expands to something
like "a-zA-Z_." (I should even removes '-')

For revision ids we use:
        s = '%s-%s-' % (self._config.user_email(),
                        compact_date(self._timestamp))
        s += hexlify(rand_bytes(8))

Obviously timestamp and hexlify won't generate non-ascii.

The problem, though is code like 'Tailor', et al. I know we are safe for
baz=>bzr, because Arch never supported anything other than ASCII (at
least not officially).

> 
> Even leaving aside the costs of conversion we should consider more
> generally what the costs are of storing this at commit time.
> 

The other thing to realize is we use cElementTree (right now) to read
them. So even if we serialize to utf-8, the read() step is going to try
and create them back as Unicode. (even if they are plain ascii).

We could always add another step to turn them back into ascii, though.

Unicode revision ids wouldn't have worked at all in v5 or v6 formats
(because we did a hash of the revision id to determine where it would be
put on the filesystem, and I don't think we checked for unicode).

For file ids, it would currently break in bzr.dev. So I'm guessing we
don't have any revision ids or file ids that are non-ascii out in the
wild. I can't promise that for revision ids, I can promise that for file
ids.

I realize that revision ids are mostly arbitrary. But people are more
and more assigning meaning to them. Mostly as part of the conversion
process.

I don't know how SVN handles non-ascii paths, but I know bzr-svn uses
path-from-root-to-branch as part of its revision id. So if you have a
non-ascii path, you need a non-ascii revision id.

I do believe the utf8=>unicode conversion is down in the noise right
now, but that doesn't mean it won't become a bigger deal as we do more work.

(Like with dirstate, we can shave off a bit of time if we leave things
as utf8.)

John
=:->

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 254 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060812/77079969/attachment.pgp