newformat format change

John A Meinel john at arbash-meinel.com
Mon Oct 17 03:45:53 BST 2005


Martin Pool wrote:
> On 07/10/05, John A Meinel <john at arbash-meinel.com> wrote:
>
>>Martin Pool wrote:
>>
>>>On 05/10/05, John A Meinel <john at arbash-meinel.com> wrote:
>>>
>>>
>>>
>>>>One problem with the current trapping (where you use a regular
>>>>expression to substitute everything that isn't a word character)
>>>>name = re.sub(r'[^\w.]', '', name)
>>>>
>>>>Which I believe will catch newlines and tabs. But it also seems to catch
>>>>too much in the way of international characters.
>>>>
>>>>Back when I was testing with Arabic characters, it was essentially
>>>>generating file-ids with just the last portion (no filename part).
>>>>Now, maybe you feel that your unique identifier is sufficient (it could be).
>>>
>>>
>>>I thought it would consider all unicode word characters to match \w.
>>>We should fix that; it seems reasonable to treat the id as unicode
>>>since filenames can be.
>>>
>>
>>Well, I just checked with this bit of code:
>>
>> >>> import re
>> >>> x = u'Juju <\u062c\u0648\u062c\u0648>'
>> >>> r = re.sub(u'[^\\w.]', '', x)
>> >>> r
>>u'Juju'
>
>
> I was thinking about this a bit more, and it seemed to me that there
> are some advantages to making file and revision ids always be ascii:
> we can then safely pass them over http and similar transports without
> worrying about encoding or escaping.     I think if we had a file in
> the repository whose name contained those characters then a fair
> fraction of users would have trouble using it for reasons beyond our
> control.

Well, I'm not sure about revision ids, but for file ids, if the id is
unicode, then you are associated with a file which has a unicode path,
so you have to worry about the escaping anyway.

>
> Another way to say this is that escaping ought to be performed once at
> the time the ID is assigned, by e.g. changing non-ascii characters to
> their '\u' forms.  But then using backslash may cause other troubles;
> perhaps we should use _.

I think as long as you use a deterministic mapping you would be okay. If
you decide to use _, then it needs to be disallowed otherwise in the id,
so that it doesn't accidentally get unescaped.

And for revision ids, there are a lot of valid characters in an email
address (underscore being one of them).

I would tend to just leave characters the way they are, and just work
out how to transport them.

John
=:->

>
> --
> Martin
>
>

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 253 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20051016/635e1996/attachment.pgp 


More information about the bazaar mailing list