New format checklist
John A Meinel
john at arbash-meinel.com
Tue Jan 10 02:03:42 GMT 2006
Martin Pool wrote:
> On Tue, 2006-01-03 at 10:28 +1100, Robert Collins wrote:
>
>>> 3) Encoding revision & file ids
>>> The final decision seemed to be that we should allow most unicode
>>> characters in revision and file ids. Which means we need a mapping
>>> between ids and valid filesystem names.
>
> Frankly I would still prefer we simply constrain the characters that can
> be used in file-ids to something believed safe in relevant contexts, and
> so avoid a whole encoding/decoding stage. Are other people still
> strongly opposed?
>
> As I recall the arguments are:
>
> * we might guess wrong, and allow a character which turns out not to be
> permitted in some relevant context, so we need to add escaping after all
>
> * There is existing data from baz2bzr that uses ':' in file-ids.
>
>>> I'm not sure how to map unicode into filenames. I know there is
>>> urlencoding which should handle a lot of the bad characters. But
>>> does it map from unicode? A lot of filesystems have their own
>>> encoding (on windows it is UTF-16/mbcs, a lot of other systems use
>>> utf-8). Do we want to do something like:
>>> path = urlencode(revision_id.encode('utf-8'))
>
>> To get from unicode to URL formats, a URL producer SHOULD do:
>> urlencode(unicode.encode('utf8'))
>
> I'll just stress that although we might use %-escaping as for URLs, this
> is a different operation. %-escaping in the URL invites the web server
> to decode them into bytes, then to map those bytes to the filesystem
> layer. But I don't think there's any way to tell what encoding the
> server expects for those characters. So while we may %-escape the utf-8
> form, the server might decode it as 8859-1. The server is not even
> necessarily going to use its filesystem's encoding.
>
> If this is done, it's probably a storage format thing: the store
> specifies that an id is actually written to the transport escaped in a
> particular way.
>
> Not also that urllib.quote doesn't escape all of the characters which
> need to be escaped for the Windows filesystem - so it'd need to be
> something different anyhow.
>
> If we did use %-escaping then those characters will need to be
> doubly-escaped when sent over http. So a Unicode character can expand
> to 3 utf-8 bytes, each of which is 3 quoted bytes '%ab'. Then to send
> that over http requires the % characters to be quoted again, expanding
> each to 3 bytes. So each unicode character can expand to 15 bytes,
> which is faintly ridiculous.
Well, actually, utf-8 at least specifies the possibility for characters
to go out to 5 bytes. So you could theoretically have one unicode
character expand to 25 characters.
I wouldn't mind a different escaping, simply because that is really ugly.
But I agree with Aaron, that for some systems it can be nice to give
real meaning to the revision identifiers.
>
>>> 4) Refactoring of metadata. We would like to split things under .bzr
>>> into:
>>> .bzr/checkout/
>>> .bzr/branch/
>>> .bzr/repository/
>>> Which will help ease the transition when we start separating where
>>> these objects reside.
>
> That sounds good.
>
>> I'd like us to get into the habit of small, frequent format changes
>> rather than big painful experiences. I figure the first one will be
>> traumatic *code wise* as we put in place the needed facilities to select
>> formats for new branches, convert remote branches etc. And for that
>> reason I think the first format change should be a No-op change - that
>> is, a format number bump with no actual changes to the disk format - to
>> let us get the infrastructure right.
>
> And, for example, test that we can create and read both formats from in
> the code and the test suite.
>
Are you wanting to have more 'adapt' patterns, to provide multiple
branch formats?
John
=:->
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 249 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060109/cd9b01bd/attachment.pgp
More information about the bazaar
mailing list