newformat format change

Mon Oct 17 04:01:00 BST 2005

On 17/10/05, John A Meinel <john at arbash-meinel.com> wrote:
> Martin Pool wrote:

> Well, I'm not sure about revision ids, but for file ids, if the id is
> unicode, then you are associated with a file which has a unicode path,
> so you have to worry about the escaping anyway.

Well, we don't currently access working copies over http, only history
files.  And we have no choice about the working copy files, but we do
have a choice about the names of the history files.

> I think as long as you use a deterministic mapping you would be okay. If
> you decide to use _, then it needs to be disallowed otherwise in the id,
> so that it doesn't accidentally get unescaped.
>
> And for revision ids, there are a lot of valid characters in an email
> address (underscore being one of them).

I'd suggest we map underscore to _5f.  Or we could use %, as a
character that occurs less often in filenames or email addresses. 
This would then be doubly escaped over http - so a file called % would
be stored in %%.weave, and requested as http://......./%%%%.weave

Once file or revision ids are generated we never unescape them (or
shouldn't.)   Having them be readable is just a convenience for
debugging.

> I would tend to just leave characters the way they are, and just work
> out how to transport them.

Suppose someone has a branch on their ISP's webserver (where they
cannot reconfigure it).  If we send it a request for a unicode url it
may try to interpret it as 8859-1, utf-8, ascii, or anything else. 
Now in theory we can fix this by teaching the client what encoding to
use for urls, or telling people how to reconfigure their servers, but
it seems much better to just avoid it.

--
Martin