Robert Collins robert.collins at
Tue Jun 2 07:43:19 BST 2009

On Tue, 2009-06-02 at 15:27 +0900, Stephen J. Turnbull wrote:
> Robert Collins writes:
>  > Unicode in python 2.x is terribly slow: its four times the memory size
>  > (for linux builds at least),
> Er, what are you putting in memory, that this matters?

Off the top of my head:
 - paths of various sorts - utf8(whats in the dirstate),
"native"-disk-encoding(utf8-assumed for linux, unicode for windows).
unicode normalising file systems get into fun territory here.
   + for these the size isn't hugely critical, even a 100K path tree is
only going to be what, 20ish MB with 4-byte codepoints.
   + converting these to and from native encoding has proven to be a
significant bottleneck. Your theories about performance notwithstanding,
bzr status would be significantly slower if we didn't keep conversions
to a minimum (and on linux wuth a utf8 locale we avoid converting at
 - user file content, which ranges in size from 100-byte files to
hundreds of MB.
   + we provide colourisation and syntax highlighting in various tools
built on the core bzrlib; for these we can reasonably expect the higher
level libraries to start wanting 'text' rather than 'basestr' to work

> But there are other reasons for converting exactly once on the way in,
> and exactly once on the way out.

Sure, and I think you'll find we do that. We just avoid converting if we
don't have to. Internally, in the inventory, we're utf8 all the way.

>   Correctness reasons.  Specifically,
> you can't be sure that users are going to type the same thing on the
> command line that is in the directory on disk, especially in a Unicode
> environment.  So you want to normalize as well as convert.  It's true
> that in most cases file systems like HFS+ will DTRT for you, but
> occasionally they can get confused.  This is most frequent in my
> experience when you're doing cross-platform work (and I use VCSes for
> that a lot...).

HFS+ is pure evil from bzr's perspective, because 'open()' + 'listdir()'
doesn't round trip. John has all the details in his head, if you want
more info :).

>  > The clear separation of bytes and text is useful for us, because it
>  > lets us stay in bytes all the way until we actually want to render
>  > strings for users.
> It's not clear to me whether or not that's a good idea.  The thing is,
> although POSIX defines a file name basically as a sequence of octets
> containing no NULs, people think of them as character strings.
> Treating them as str (py3k-style, ie, Unicode) internally is going to
> buy a bit of intuitiveness now, and I expect over the life of Python 3
> they'll be moving more and more toward full Unicode correctness (as an
> option; GvR insists that a Python str is and will remain a sequence of
> 2-byte units, to be interpreted by the stdlib as UTF-16 characters,
> but otherwise not to be encumbered by Unicode correctness at the
> language and stdlib levels).
> For example, PEP 383 (Python 3 only) provides for roundtripping
> non-decodable sequences of bytes in system interfaces (specifically,
> directory listings, command line items, and environment variables).
> Of course this will work out in Python 2 because you're doing bytes,
> but it's quite painful on Windows because of the deficiencies in the
> mbcs interface.  PEP 383 provides a standard workable interface on all
> systems currently capable of running Python 3.

nondecodable sequences are interesting, but I'm not convinced by PEP383
in the context of a VCS. The problem is that the nondecodable name may
be more decodable somewhere else; and that leads to a lovely
roundtripping problem. A second related problem is that filenames may be
referenced as bytes by other software, so a VCS that does what bzr does,
which is honour local encoding for file names, may actually *cause*
interoperability issues [until that software, like 'make', gets fixed].

>  > I don't know if we have the willpower to support two languages for any
>  > length of time; I suspect our changeover will be driven by the
>  > availability of python 3.x (or backports thereof) in long term releases
>  > of distributions.
> "Support", of course you should avoid that.  That's what everybody is
> saying about their commercial/supported releases, from Twisted to
> Django to Zope to eGenix.
> On the other hand, from the point of view of an open source
> development community, the Python 3 developers really hammered on the
> "(and preferably only one)" in "TOOWTDI".  Python 2.7 has most of the
> features of 3.0, but it has an awful lot of deprecated cruft, too.
> Although the docs aren't quite as good as Python 2.6 or Python 2.7
> yet, the language is a lot simpler to understand for the new hacker
> IMO.  And it's much nicer to program for interfaces to humans (ie,
> using text from the get-go instead of having str be the encoded
> version) in Python 3.

So far, I've seen few things that would really make our life
dramatically easier. 'with' would be great. An a 70% or so decrease in
runtime would be nice too.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
Url : 

More information about the bazaar mailing list