[merge] cache encoding

John Arbash Meinel john at arbash-meinel.com
Thu Aug 17 19:42:16 BST 2006


Martin Pool wrote:
> On 14 Aug 2006, John Arbash Meinel <john at arbash-meinel.com> wrote:
> 
>> Sure. But we don't check that, so we should update the email-address
>> parser to require it. (Or at least the email => revision_id code).
>>
>> There is also punycode domain names. Which might be nicer as a real
>> unicode string internally.
> 
> I can see presenting them to the user as unicode, but I don't see why
> we'd want them internally as unicode.  They're just ids; the computer
> doesn't care.

Actually, while digging into it Testament asserts that revision ids are
ascii. (or at least they are not Unicode).

The first line in __init__ is:
self.revision_id = str(rev.revision_id)

Now, this will only be triggered if people are using signatures or
bundles, because I don't believe we generate testaments otherwise.

> 
>>> Right, but if someone is converting from a source which has nonascii
>>> bits, they can always do the escaping themselves, in the code which
>>> specifically has to support it.
>> Sure. But the only Unicode => ascii escaping I know of is urlescaping,
>> though maybe UTF-7 would fall under that category as well. The problem
>> with url escaping is that it has to be escaped again at the next layer.
> 
> I'd suggest something like '_%04x' on the unicode values - concise and
> not needing double escaping.

Well, the % still needs double escaping.

> I would be a little happier only supporting ascii, just to be
> conservative and avoid trouble later on.  But I won't insist on it.

I think we could happily support just bytestreams, rather than having
them be unicode. As you say, it could even be reasonable to go to ascii
only. There isn't a really good reason to encode/decode from unicode for
something that is just a handle. It doesn't have to have physical
meaning like file paths.

So the only real issues are that we might have unicode entries in the
wild, though the probability is low. And that our current serialization
form requires them to be either Unicode or ascii.

John
=:->

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 254 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060817/32f6894b/attachment.pgp 


More information about the bazaar mailing list