newformat format change

Mon Oct 17 03:25:01 BST 2005

On 07/10/05, John A Meinel <john at arbash-meinel.com> wrote:
> Martin Pool wrote:
> > On 05/10/05, John A Meinel <john at arbash-meinel.com> wrote:
> >
> >
> >>One problem with the current trapping (where you use a regular
> >>expression to substitute everything that isn't a word character)
> >>name = re.sub(r'[^\w.]', '', name)
> >>
> >>Which I believe will catch newlines and tabs. But it also seems to catch
> >> too much in the way of international characters.
> >>
> >>Back when I was testing with Arabic characters, it was essentially
> >>generating file-ids with just the last portion (no filename part).
> >>Now, maybe you feel that your unique identifier is sufficient (it could be).
> >
> >
> > I thought it would consider all unicode word characters to match \w.
> > We should fix that; it seems reasonable to treat the id as unicode
> > since filenames can be.
> >
>
> Well, I just checked with this bit of code:
>
>  >>> import re
>  >>> x = u'Juju <\u062c\u0648\u062c\u0648>'
>  >>> r = re.sub(u'[^\\w.]', '', x)
>  >>> r
> u'Juju'

I was thinking about this a bit more, and it seemed to me that there
are some advantages to making file and revision ids always be ascii:
we can then safely pass them over http and similar transports without
worrying about encoding or escaping.     I think if we had a file in
the repository whose name contained those characters then a fair
fraction of users would have trouble using it for reasons beyond our
control.

Another way to say this is that escaping ought to be performed once at
the time the ID is assigned, by e.g. changing non-ascii characters to
their '\u' forms.  But then using backslash may cause other troubles;
perhaps we should use _.

--
Martin