New format checklist
Martin Pool
mbp at sourcefrog.net
Tue Jan 10 01:25:59 GMT 2006
On Tue, 2006-01-03 at 10:28 +1100, Robert Collins wrote:
> > 3) Encoding revision & file ids
> > The final decision seemed to be that we should allow most unicode
> > characters in revision and file ids. Which means we need a mapping
> > between ids and valid filesystem names.
Frankly I would still prefer we simply constrain the characters that can
be used in file-ids to something believed safe in relevant contexts, and
so avoid a whole encoding/decoding stage. Are other people still
strongly opposed?
As I recall the arguments are:
* we might guess wrong, and allow a character which turns out not to be
permitted in some relevant context, so we need to add escaping after all
* There is existing data from baz2bzr that uses ':' in file-ids.
> > I'm not sure how to map unicode into filenames. I know there is
> > urlencoding which should handle a lot of the bad characters. But
> > does it map from unicode? A lot of filesystems have their own
> > encoding (on windows it is UTF-16/mbcs, a lot of other systems use
> > utf-8). Do we want to do something like:
> > path = urlencode(revision_id.encode('utf-8'))
> To get from unicode to URL formats, a URL producer SHOULD do:
> urlencode(unicode.encode('utf8'))
I'll just stress that although we might use %-escaping as for URLs, this
is a different operation. %-escaping in the URL invites the web server
to decode them into bytes, then to map those bytes to the filesystem
layer. But I don't think there's any way to tell what encoding the
server expects for those characters. So while we may %-escape the utf-8
form, the server might decode it as 8859-1. The server is not even
necessarily going to use its filesystem's encoding.
If this is done, it's probably a storage format thing: the store
specifies that an id is actually written to the transport escaped in a
particular way.
Not also that urllib.quote doesn't escape all of the characters which
need to be escaped for the Windows filesystem - so it'd need to be
something different anyhow.
If we did use %-escaping then those characters will need to be
doubly-escaped when sent over http. So a Unicode character can expand
to 3 utf-8 bytes, each of which is 3 quoted bytes '%ab'. Then to send
that over http requires the % characters to be quoted again, expanding
each to 3 bytes. So each unicode character can expand to 15 bytes,
which is faintly ridiculous.
> > 4) Refactoring of metadata. We would like to split things under .bzr
> > into:
> > .bzr/checkout/
> > .bzr/branch/
> > .bzr/repository/
> > Which will help ease the transition when we start separating where
> > these objects reside.
That sounds good.
>
> I'd like us to get into the habit of small, frequent format changes
> rather than big painful experiences. I figure the first one will be
> traumatic *code wise* as we put in place the needed facilities to select
> formats for new branches, convert remote branches etc. And for that
> reason I think the first format change should be a No-op change - that
> is, a format number bump with no actual changes to the disk format - to
> let us get the infrastructure right.
And, for example, test that we can create and read both formats from in
the code and the test suite.
--
Martin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060110/e9db4bff/attachment.pgp
More information about the bazaar
mailing list