[merge] cache encoding

Mon Aug 14 15:04:41 BST 2006

Martin Pool wrote:
> On 12 Aug 2006, John Arbash Meinel <john at arbash-meinel.com> wrote:
> 
>> Well, bzr itself has not been able to create anything but restricted
>> ascii revision ids and file ids for a while.
>>
>> I guess we could get non-ascii revision ids if the user's email
>> contained a non-ascii character. The file-id generator removes
>> everything that matches '[^\w.]' which I believe expands to something
>> like "a-zA-Z_." (I should even removes '-')
> 
> It's fairly common to have nonascii characters in the 'real name' part
> of the email address but I don't think they're valid, or at least
> they're very rare, in the actual address itself, which is all we want.

Sure. But we don't check that, so we should update the email-address
parser to require it. (Or at least the email => revision_id code).

There is also punycode domain names. Which might be nicer as a real
unicode string internally.

> 
>> For revision ids we use:
>>         s = '%s-%s-' % (self._config.user_email(),
>>                         compact_date(self._timestamp))
>>         s += hexlify(rand_bytes(8))
>>
>> Obviously timestamp and hexlify won't generate non-ascii.
>>
>> The problem, though is code like 'Tailor', et al. I know we are safe for
>> baz=>bzr, because Arch never supported anything other than ASCII (at
>> least not officially).
> 
> I don't know of any other system which does, so I don't see why Tailor
> would be trying to support them.
> 
> We can't be constrained to continue permitting everything which was just
> not prohibited before.

I'm thinking about supporting existing conversions. And bzr-svn might
have already created non-ascii revision ids. (Because they include the
branch path as part of the revision id).

> 
>> I realize that revision ids are mostly arbitrary. But people are more
>> and more assigning meaning to them. Mostly as part of the conversion
>> process.
> 
> Right, but if someone is converting from a source which has nonascii
> bits, they can always do the escaping themselves, in the code which
> specifically has to support it.

Sure. But the only Unicode => ascii escaping I know of is urlescaping,
though maybe UTF-7 would fall under that category as well. The problem
with url escaping is that it has to be escaped again at the next layer.

> 
>> I do believe the utf8=>unicode conversion is down in the noise right
>> now, but that doesn't mean it won't become a bigger deal as we do more work.
> 
> Making sure they get handled correctly is one more thing to get right,
> and one more thing that can take time.  Unless we really need it, and I
> don't think we do, then we might as well restrict it.
> 
> So I would say, just require them to be ascii.  If we can agree on that,
> we should put it in developer documentation somewhere (bzrlib/doc/api or
> HACKING.)
> 

At this point, I don't think we can do filesystem-safe ascii. (no :, /,
", <, >, etc), and I think it would be overly restrictive to require
them to be lowercase. (Especially thinking of SVN conversions, which
then have to escape all sorts of things).

I would be a little happier just having them be utf-8. But I'm okay with
them being ASCII. I'm probably +0.25 on it, though. Being unicode gives
us flexibility, but it might be reasonable for a while to assert that
they are ascii.

John
=:->

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 254 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060814/0469ea62/attachment.pgp