[RFC] Use utf-8 revision ids

Thu Feb 1 15:55:57 GMT 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Martin Pool wrote:
> On 31 Jan 2007, John Arbash Meinel <john at arbash-meinel.com> wrote:
>> This patch is not ready to be completely merged, but I think the
>> performance results show that it is worth being considered.
>>
>> Right now all of our natively generated bzr revision ids and file ids
>> are ascii only. Because we explicitly strip out the other characters.
>> We've made the statement that they are Unicode, but we can pretty easily
>> change that to saying they must be utf-8. Or even ascii only (you would
>> have to encode your information somehow).
>>
>> The attached patch changes the Knit reading code, so that it does not
>> decode revision ids. Both as part of parsing the line deltas and
>> fulltexts, and as part of parsing KnitIndex files.
> 
> This patch is ok with me.  It's a nice saving.
> 
> It does seem there may be some breakage if people have somehow generated
> non-ascii ids, and their default encoding is not utf-8 -- they may get
> an exception when trying to compare the str value to a unicode revision
> id they got from somewhere else.  But such data should be rare, and if
> we generally go towards treating these ids just as strs then it should
> be ok.
> 
> 

Well, I would like to update the XML Serializer to re-encoded the
unicode strings back into utf-8 (it already does a get_cached_unicode
call, so this won't be adding any overhead, in fact it should decrease
it slightly).

And then I need to change the serializer to expect them to be utf-8 so
it doesn't try to encode them again.

And then audit some of the rest of the code base to make sure that we
aren't generating unicode strings. Like update gen_revision_id() to
create a utf-8 string, and maybe some other things like have commit()
take a 'rev_id' and then it does:

rev_id = safe_utf8(rev_id)

sort of like "safe_unicode", which is basically described as:

def safe_utf8(unicode_or_utf8, deprecated=True):
  if isinstance(unicode_or_utf8, unicode):
    if deprecated:
      warn('Unicode revision ids were deprecated in bzr-0.15,'
	   'use a utf-8 string instead')
    return unicode_or_utf8.encode('utf8')
  return unicode_or_utf8

Or something like that. And then things which are clearly user-facing
functions can be re-defined to include safe_utf8.

Also, this would change Branch.revision_history() so that it doesn't
decode utf-8, etc.

I think there are a lot of places this could be used, and it would have
a positive performance impact across the board. I don't know if I'll
have time to do it all right now, as I'd like to work more on stuff like
dirstate.

But I think it would be worthwhile for someone to spend some time on it.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFwg2NJdeBCYSNAAMRAqyoAJ9nR1lW2DISQhEFLWpMumXaUt7mRACcDvv/
xQy/xCi8k56zGwaDwOD1m0c=
=Yr4l
-----END PGP SIGNATURE-----