[RFC] Use utf-8 revision ids

Wed Jan 31 20:27:51 GMT 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

This patch is not ready to be completely merged, but I think the
performance results show that it is worth being considered.

Right now all of our natively generated bzr revision ids and file ids
are ascii only. Because we explicitly strip out the other characters.
We've made the statement that they are Unicode, but we can pretty easily
change that to saying they must be utf-8. Or even ascii only (you would
have to encode your information somehow).

The attached patch changes the Knit reading code, so that it does not
decode revision ids. Both as part of parsing the line deltas and
fulltexts, and as part of parsing KnitIndex files.

On my machine, this has a pretty big impact on the time to 'bzr checkout
- --lightweight' a copy of bzr.dev.

Specifically, it takes an average of 4.2 (+-0.05), versus 3.7 (+-0.05).
Which is a savings of 500ms (almost 10% of our build-tree time).

Now, this is CPU time, not actual time, but I've found actual time to be
very unstable. To give an example:

bzr.dev
user  sys   total
4.19s 0.51s 4.836
4.20s 0.48s 9.061
4.29s 0.41s 6.594
4.15s 0.50s 8.655
4.24s 0.44s 6.914
4.23s 0.46s 8.136
4.22s 0.41s 7.035
4.18s 0.48s 6.614
4.29s 0.43s 8.423
4.18s 0.49s 6.577
4.19s 0.48s 8.317
4.21s 0.44s 6.707
4.30s 0.45s 6.766
4.18s 0.48s 6.840
4.17s 0.48s 6.373

no-decode
user  sys   total
3.73s 0.44s 4.179
3.70s 0.46s 8.243
3.81s 0.47s 6.351
3.79s 0.43s 6.209
3.72s 0.52s 8.279
3.68s 0.52s 5.851
3.80s 0.41s 6.361
3.69s 0.50s 6.675
3.71s 0.46s 4.738
3.71s 0.44s 7.315
3.76s 0.48s 6.367
3.83s 0.46s 6.366
3.70s 0.54s 5.548
3.66s 0.52s 8.022
3.71s 0.46s 6.459

You can see that user and sys time are actually fairly stable (they vary
by ~0.1s or maybe 20%). However the 'total' time varies dramatically.
Like 100% (8.2s versus 4.2s).

Now this could be what hg was talking about when they were dealing with
'seek' time issues. This is on a fairly quiet machine with lots of RAM
to cache disk content, though, so disk seek time shouldn't really be a
factor.

I think the big wins with the new code are:

1) Avoid calling decode entierly
2) Can use a list comprehension that only does 'string.split()'. So
there is only a C function call.

Now, I know we've talked about switching before, but I thought it was
worthwhile to show specifically how it would affect a given operation.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFwPvHJdeBCYSNAAMRAveHAKCpW2tE7AFF/LucZwC1Ya5BMHYyUwCgrf0b
+wE402TpDJTM1J8066rZFTU=
=awYW
-----END PGP SIGNATURE-----
-------------- next part --------------
A non-text attachment was scrubbed...
Name: knit_no_decode.diff
Type: text/x-patch
Size: 2142 bytes
Desc: not available
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20070131/f241da65/attachment.bin