New format checklist

Martin Pool mbp at sourcefrog.net
Tue Jan 10 01:25:59 GMT 2006


On Tue, 2006-01-03 at 10:28 +1100, Robert Collins wrote:

> > 3) Encoding revision & file ids
> >    The final decision seemed to be that we should allow most unicode
> >    characters in revision and file ids. Which means we need a mapping
> >    between ids and valid filesystem names.

Frankly I would still prefer we simply constrain the characters that can
be used in file-ids to something believed safe in relevant contexts, and
so avoid a whole encoding/decoding stage.  Are other people still
strongly opposed?

As I recall the arguments are:

 * we might guess wrong, and allow a character which turns out not to be
permitted in some relevant context, so we need to add escaping after all

 * There is existing data from baz2bzr that uses ':' in file-ids.

> >    I'm not sure how to map unicode into filenames. I know there is
> >    urlencoding which should handle a lot of the bad characters. But
> >    does it map from unicode? A lot of filesystems have their own
> >    encoding (on windows it is UTF-16/mbcs, a lot of other systems use
> >    utf-8). Do we want to do something like:
> >       path = urlencode(revision_id.encode('utf-8'))

> To get from unicode to URL formats, a URL producer SHOULD do:
> urlencode(unicode.encode('utf8'))

I'll just stress that although we might use %-escaping as for URLs, this
is a different operation.  %-escaping in the URL invites the web server
to decode them into bytes, then to map those bytes to the filesystem
layer.  But I don't think there's any way to tell what encoding the
server expects for those characters.  So while we may %-escape the utf-8
form, the server might decode it as 8859-1.  The server is not even
necessarily going to use its filesystem's encoding.

If this is done, it's probably a storage format thing: the store
specifies that an id is actually written to the transport escaped in a
particular way.

Not also that urllib.quote doesn't escape all of the characters which
need to be escaped for the Windows filesystem - so it'd need to be
something different anyhow.

If we did use %-escaping then those characters will need to be
doubly-escaped when sent over http.  So a Unicode character can expand
to 3 utf-8 bytes, each of which is 3 quoted bytes '%ab'.  Then to send
that over http requires the % characters to be quoted again, expanding
each to 3 bytes.  So each unicode character can expand to 15 bytes,
which is faintly ridiculous.

> > 4) Refactoring of metadata. We would like to split things under .bzr
> >    into:
> > 	.bzr/checkout/
> > 	.bzr/branch/
> > 	.bzr/repository/
> >    Which will help ease the transition when we start separating where
> >    these objects reside.

That sounds good.

> 
> I'd like us to get into the habit of small, frequent format changes
> rather than big painful experiences. I figure the first one will be
> traumatic *code wise* as we put in place the needed facilities to select
> formats for new branches, convert remote branches etc. And for that
> reason I think the first format change should be a No-op change - that
> is, a format number bump with no actual changes to the disk format - to
> let us get the infrastructure right.

And, for example, test that we can create and read both formats from in
the code and the test suite.

-- 
Martin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060110/e9db4bff/attachment.pgp 


More information about the bazaar mailing list