[merge] cache encoding

Fri Aug 11 02:03:25 BST 2006

On 10 Aug 2006, John Arbash Meinel <john at arbash-meinel.com> wrote:
> Attached is a bundle that caches encode/decode from and to utf8. The
> biggest application for this is the fact that when you commit a new
> kernel tree, it has to annotate every line in the tree with the current
> revision. The specific location that I saw was this line in knit
> 
> return ['%s %s' % (o.encode('utf-8'), t) for o, t in content._lines]
> 
> So basically, it was doing a new encode for *every* line. Which with a
> new kernel tree, you have 7.7M lines. This doesn't account for a huge
> portion of the overall time (only about 45s/10min). But it doesn't hurt
> to do it faster.
> 
> The attached patch also uses the same caches so that file ids and
> revision ids are cached in memory. Sort of like using intern(), only
> intern() doesn't work on unicode strings.
> 
> I also added some benchmarks. And on my machine, to do 1M foo.decode()
> calls takes 4s, but 1M calls to decode_utf8() lookups takes about 800ms.
> foo.encode() is faster, only taking 1.5-2.5s, but encode_utf8() also
> takes only 800ms.

I haven't read the patch in detail but it sounds broadly plausible to
cache those transformations.

I rather wish I'd insisted that revision ids be just restricted ascii in
the first place so we could avoid more of these issues altogether.
Would it be too late to at least require they're ascii?  Perhaps we can
transition to doing that.  It's kind of pointless to spend time
converting something which is likely not going to be Unicode at all.

Even leaving aside the costs of conversion we should consider more
generally what the costs are of storing this at commit time.

-- 
Martin