[RFC] Use utf-8 revision ids
John Arbash Meinel
john at arbash-meinel.com
Thu Feb 1 15:55:57 GMT 2007
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Martin Pool wrote:
> On 31 Jan 2007, John Arbash Meinel <john at arbash-meinel.com> wrote:
>> This patch is not ready to be completely merged, but I think the
>> performance results show that it is worth being considered.
>>
>> Right now all of our natively generated bzr revision ids and file ids
>> are ascii only. Because we explicitly strip out the other characters.
>> We've made the statement that they are Unicode, but we can pretty easily
>> change that to saying they must be utf-8. Or even ascii only (you would
>> have to encode your information somehow).
>>
>> The attached patch changes the Knit reading code, so that it does not
>> decode revision ids. Both as part of parsing the line deltas and
>> fulltexts, and as part of parsing KnitIndex files.
>
> This patch is ok with me. It's a nice saving.
>
> It does seem there may be some breakage if people have somehow generated
> non-ascii ids, and their default encoding is not utf-8 -- they may get
> an exception when trying to compare the str value to a unicode revision
> id they got from somewhere else. But such data should be rare, and if
> we generally go towards treating these ids just as strs then it should
> be ok.
>
>
Well, I would like to update the XML Serializer to re-encoded the
unicode strings back into utf-8 (it already does a get_cached_unicode
call, so this won't be adding any overhead, in fact it should decrease
it slightly).
And then I need to change the serializer to expect them to be utf-8 so
it doesn't try to encode them again.
And then audit some of the rest of the code base to make sure that we
aren't generating unicode strings. Like update gen_revision_id() to
create a utf-8 string, and maybe some other things like have commit()
take a 'rev_id' and then it does:
rev_id = safe_utf8(rev_id)
sort of like "safe_unicode", which is basically described as:
def safe_utf8(unicode_or_utf8, deprecated=True):
if isinstance(unicode_or_utf8, unicode):
if deprecated:
warn('Unicode revision ids were deprecated in bzr-0.15,'
'use a utf-8 string instead')
return unicode_or_utf8.encode('utf8')
return unicode_or_utf8
Or something like that. And then things which are clearly user-facing
functions can be re-defined to include safe_utf8.
Also, this would change Branch.revision_history() so that it doesn't
decode utf-8, etc.
I think there are a lot of places this could be used, and it would have
a positive performance impact across the board. I don't know if I'll
have time to do it all right now, as I'd like to work more on stuff like
dirstate.
But I think it would be worthwhile for someone to spend some time on it.
John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQFFwg2NJdeBCYSNAAMRAqyoAJ9nR1lW2DISQhEFLWpMumXaUt7mRACcDvv/
xQy/xCi8k56zGwaDwOD1m0c=
=Yr4l
-----END PGP SIGNATURE-----
More information about the bazaar
mailing list