[merge] cache encoding

Tue Aug 15 05:49:33 BST 2006

On 14 Aug 2006, John Arbash Meinel <john at arbash-meinel.com> wrote:

> Sure. But we don't check that, so we should update the email-address
> parser to require it. (Or at least the email => revision_id code).
> 
> There is also punycode domain names. Which might be nicer as a real
> unicode string internally.

I can see presenting them to the user as unicode, but I don't see why
we'd want them internally as unicode.  They're just ids; the computer
doesn't care.

> > Right, but if someone is converting from a source which has nonascii
> > bits, they can always do the escaping themselves, in the code which
> > specifically has to support it.
> 
> Sure. But the only Unicode => ascii escaping I know of is urlescaping,
> though maybe UTF-7 would fall under that category as well. The problem
> with url escaping is that it has to be escaped again at the next layer.

I'd suggest something like '_%04x' on the unicode values - concise and
not needing double escaping.

> At this point, I don't think we can do filesystem-safe ascii. (no :, /,
> ", <, >, etc), and I think it would be overly restrictive to require
> them to be lowercase. (Especially thinking of SVN conversions, which
> then have to escape all sorts of things).
> 
> I would be a little happier just having them be utf-8. But I'm okay with
> them being ASCII. I'm probably +0.25 on it, though. Being unicode gives
> us flexibility, but it might be reasonable for a while to assert that
> they are ascii.

I would be a little happier only supporting ascii, just to be
conservative and avoid trouble later on.  But I won't insist on it.

-- 
Martin