New format checklist

John A Meinel john at arbash-meinel.com
Tue Jan 10 02:03:42 GMT 2006


Martin Pool wrote:
> On Tue, 2006-01-03 at 10:28 +1100, Robert Collins wrote:
> 
>>> 3) Encoding revision & file ids
>>>    The final decision seemed to be that we should allow most unicode
>>>    characters in revision and file ids. Which means we need a mapping
>>>    between ids and valid filesystem names.
> 
> Frankly I would still prefer we simply constrain the characters that can
> be used in file-ids to something believed safe in relevant contexts, and
> so avoid a whole encoding/decoding stage.  Are other people still
> strongly opposed?
> 
> As I recall the arguments are:
> 
>  * we might guess wrong, and allow a character which turns out not to be
> permitted in some relevant context, so we need to add escaping after all
> 
>  * There is existing data from baz2bzr that uses ':' in file-ids.
> 
>>>    I'm not sure how to map unicode into filenames. I know there is
>>>    urlencoding which should handle a lot of the bad characters. But
>>>    does it map from unicode? A lot of filesystems have their own
>>>    encoding (on windows it is UTF-16/mbcs, a lot of other systems use
>>>    utf-8). Do we want to do something like:
>>>       path = urlencode(revision_id.encode('utf-8'))
> 
>> To get from unicode to URL formats, a URL producer SHOULD do:
>> urlencode(unicode.encode('utf8'))
> 
> I'll just stress that although we might use %-escaping as for URLs, this
> is a different operation.  %-escaping in the URL invites the web server
> to decode them into bytes, then to map those bytes to the filesystem
> layer.  But I don't think there's any way to tell what encoding the
> server expects for those characters.  So while we may %-escape the utf-8
> form, the server might decode it as 8859-1.  The server is not even
> necessarily going to use its filesystem's encoding.
> 
> If this is done, it's probably a storage format thing: the store
> specifies that an id is actually written to the transport escaped in a
> particular way.
> 
> Not also that urllib.quote doesn't escape all of the characters which
> need to be escaped for the Windows filesystem - so it'd need to be
> something different anyhow.
> 
> If we did use %-escaping then those characters will need to be
> doubly-escaped when sent over http.  So a Unicode character can expand
> to 3 utf-8 bytes, each of which is 3 quoted bytes '%ab'.  Then to send
> that over http requires the % characters to be quoted again, expanding
> each to 3 bytes.  So each unicode character can expand to 15 bytes,
> which is faintly ridiculous.

Well, actually, utf-8 at least specifies the possibility for characters
to go out to 5 bytes. So you could theoretically have one unicode
character expand to 25 characters.
I wouldn't mind a different escaping, simply because that is really ugly.

But I agree with Aaron, that for some systems it can be nice to give
real meaning to the revision identifiers.

> 
>>> 4) Refactoring of metadata. We would like to split things under .bzr
>>>    into:
>>> 	.bzr/checkout/
>>> 	.bzr/branch/
>>> 	.bzr/repository/
>>>    Which will help ease the transition when we start separating where
>>>    these objects reside.
> 
> That sounds good.
> 
>> I'd like us to get into the habit of small, frequent format changes
>> rather than big painful experiences. I figure the first one will be
>> traumatic *code wise* as we put in place the needed facilities to select
>> formats for new branches, convert remote branches etc. And for that
>> reason I think the first format change should be a No-op change - that
>> is, a format number bump with no actual changes to the disk format - to
>> let us get the infrastructure right.
> 
> And, for example, test that we can create and read both formats from in
> the code and the test suite.
> 

Are you wanting to have more 'adapt' patterns, to provide multiple
branch formats?

John
=:->


-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 249 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060109/cd9b01bd/attachment.pgp 


More information about the bazaar mailing list