[MERGE] switch to using utf-8 revision ids

John Arbash Meinel john at arbash-meinel.com
Tue Feb 13 14:27:20 GMT 2007


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

The attached patch changes the internals to assume that revision ids are
utf-8 strings, rather than being Unicode strings.

It changes the apis for WorkingTree, Branch, Repository, and
VersionedFile so that they expect and return utf-8 revision ids. Most
functions have a osutils.safe_revision_id() call, which will encode to
utf-8 if the revision-id is Unicode. So we maintain compatibility.
Eventually, I would like safe_revision_id() to issue a deprecation
warning if it gets a Unicode revision id, but I think it is best to take
it a step at a time.

The big advantage of this patch is that we can improve our read times by
not needing to decode all of the 'revision-history', and '.kndx' files.

This code is actually really safe right now, because bzr only generates
ascii revision ids. (gen_revision_id and gen_file_id both have a very
long history of generating ascii-only ids). The only place we can get
Unicode ones right now might be from bzr-svn. Or maybe one of the other
converters. (But most of them use bzr to generate new revision ids,
baz-import and bzr-svn are the only ones I know which use revision ids
from another source, and Arch never really supported non-ascii archives,
AFAIK).

I'd like to get this sort of thing merged into bzr.dev early, since it
will give it a little bit more time for finding edge cases that expect a
Unicode string versus utf-8 string, etc. (I found some small problems
with TextStore and unicode versus utf-8).


I would also like to end up switching our file-ids to also be utf-8 only
(rather than unicode). I don't think it will help performance as much,
but I think it can help a lot when we switch to dirstate or other
formats that don't go through XML. (cElementTree will upcast to Unicode
for us if it holds non-ascii characters, so trapping back to utf-8 is
actually more work for the XML serializers).

This patch drops the cpu time from 'bzr checkout --lightweight bzr.dev'
from 4.30s down to around 3.75s. So approximately 500ms improvement. The
total

For 'bzr checkout --lightweight mozilla/HEAD' which has 50,000 files, it
drops the user time from 345s to 332s (approx 13s).

I wish the performance difference was larger, but it does open up the
ability for us to have a serialized inventory format that doesn't need
to decode everything back to unicode.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFF0crIJdeBCYSNAAMRArHBAJ4tzYFajYTf4pQM4sMeRIWAYF220ACgsXIj
bQIXlTyJ8jTMRhVKy9A3QqI=
=v2fG
-----END PGP SIGNATURE-----
-------------- next part --------------
A non-text attachment was scrubbed...
Name: knit_utf8_revision_ids.patch
Type: text/x-patch
Size: 268528 bytes
Desc: not available
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20070213/399ccb60/attachment-0001.bin 


More information about the bazaar mailing list