Bazaar internal metadata encoding

Martin Pool mbp at sourcefrog.net
Fri Jul 6 00:55:33 BST 2007


On 7/6/07, Goffredo Baroncelli <kreijack at tiscalinet.it> wrote:
> Hi all,
>
> which is the encoding of the metadata stored by bazaar ? It is correct to
> suppose that the metadata are stored as utf8 ? Or that can depend by the
> locale encoding ?
> Finally how the path and the other information are stored ?

In every case that I am aware of they are stored as utf-8.  The
general plan is that we interpret input from the user as being in
their encoding, turn it into a python unicode() object in memory (ie
UCS-4 typically), and then write it as utf-8 into our files.  This is
also done for paths, except that they are expected to be in the
filesystem encoding, which may be different from the user locale
encoding.  We never (??) interpret the encoding of file contents.

I think generally this is a good approach because then you can read
the data back out without needing to know what encoding was originally
used or how to read it.

This is generally pretty good but for two points:

-  Some users have files with names that aren't actually permitted by
their filesystem encoding -- particularly on Linux where there really
is no standardized filesystem encoding and it's easy to have for
example files in different directories with inconsistent name
encodings.  In those situations just doing no interpretation at all,
as most unix tools do, would be better in the short term.  The
drawback is that it may be hard to know how to correctly extract the
data on another machine, but in the short term people may claim they
don't care.

- In the common case of a utf-8 environment it is inefficient to go
from utf-8 to ucs-2/4 back to utf-8 every time data goes in or out,
and furthermore for typical strings ucs-2 is a lot bigger.  So it
might be good to have fast paths where when we know something is utf-8
we just keep it as a byte string.

-- 
Martin



More information about the bazaar mailing list