[merge] cache encoding

Martin Pool mbp at canonical.com
Fri Aug 18 07:54:08 BST 2006


On 17 Aug 2006, John Arbash Meinel <john at arbash-meinel.com> wrote:

> Actually, while digging into it Testament asserts that revision ids are
> ascii. (or at least they are not Unicode).
> 
> The first line in __init__ is:
> self.revision_id = str(rev.revision_id)
> 
> Now, this will only be triggered if people are using signatures or
> bundles, because I don't believe we generate testaments otherwise.

Given Robert's statement that people have unicode ids in existing
repositories, it would seem that our only feasible choice here is to
define that Unicode is allowed, unless we want to require people to
discard that history.  It follows from that that we should add a test of
making a testament with non-ascii revision and file ids, and fix the
code to pass.

> >>> Right, but if someone is converting from a source which has nonascii
> >>> bits, they can always do the escaping themselves, in the code which
> >>> specifically has to support it.
> >> Sure. But the only Unicode => ascii escaping I know of is urlescaping,
> >> though maybe UTF-7 would fall under that category as well. The problem
> >> with url escaping is that it has to be escaped again at the next layer.
> > 
> > I'd suggest something like '_%04x' on the unicode values - concise and
> > not needing double escaping.
> 
> Well, the % still needs double escaping.

I meant that as a Python format string - so unrepresentable characters
would be turned into an underscore followed by four hex chars.  This
should be safe almost anywhere and causes relatively little expansion.

> > I would be a little happier only supporting ascii, just to be
> > conservative and avoid trouble later on.  But I won't insist on it.
> 
> I think we could happily support just bytestreams, rather than having
> them be unicode. As you say, it could even be reasonable to go to ascii
> only. There isn't a really good reason to encode/decode from unicode for
> something that is just a handle. It doesn't have to have physical
> meaning like file paths.
> 
> So the only real issues are that we might have unicode entries in the
> wild, though the probability is low. And that our current serialization
> form requires them to be either Unicode or ascii.

I'd really rather not allow arbitrary non-utf-8 binary, just because it
will cause trouble if we ever do need to decode them.  And the general
policy is that strings are Unicode, so defining some strings to be
8-bit binary is just asking for trouble.

Of course having them stay in utf-8 and be treated by the program as
byte string as an optimization is totally fine.  But the interface
requirement is that they're utf-8.

-- 
Martin




More information about the bazaar mailing list