robert.collins at canonical.com
Tue Jun 2 07:43:19 BST 2009
On Tue, 2009-06-02 at 15:27 +0900, Stephen J. Turnbull wrote:
> Robert Collins writes:
> > Unicode in python 2.x is terribly slow: its four times the memory size
> > (for linux builds at least),
> Er, what are you putting in memory, that this matters?
Off the top of my head:
- paths of various sorts - utf8(whats in the dirstate),
"native"-disk-encoding(utf8-assumed for linux, unicode for windows).
unicode normalising file systems get into fun territory here.
+ for these the size isn't hugely critical, even a 100K path tree is
only going to be what, 20ish MB with 4-byte codepoints.
+ converting these to and from native encoding has proven to be a
significant bottleneck. Your theories about performance notwithstanding,
bzr status would be significantly slower if we didn't keep conversions
to a minimum (and on linux wuth a utf8 locale we avoid converting at
- user file content, which ranges in size from 100-byte files to
hundreds of MB.
+ we provide colourisation and syntax highlighting in various tools
built on the core bzrlib; for these we can reasonably expect the higher
level libraries to start wanting 'text' rather than 'basestr' to work
> But there are other reasons for converting exactly once on the way in,
> and exactly once on the way out.
Sure, and I think you'll find we do that. We just avoid converting if we
don't have to. Internally, in the inventory, we're utf8 all the way.
> Correctness reasons. Specifically,
> you can't be sure that users are going to type the same thing on the
> command line that is in the directory on disk, especially in a Unicode
> environment. So you want to normalize as well as convert. It's true
> that in most cases file systems like HFS+ will DTRT for you, but
> occasionally they can get confused. This is most frequent in my
> experience when you're doing cross-platform work (and I use VCSes for
> that a lot...).
HFS+ is pure evil from bzr's perspective, because 'open()' + 'listdir()'
doesn't round trip. John has all the details in his head, if you want
more info :).
> > The clear separation of bytes and text is useful for us, because it
> > lets us stay in bytes all the way until we actually want to render
> > strings for users.
> It's not clear to me whether or not that's a good idea. The thing is,
> although POSIX defines a file name basically as a sequence of octets
> containing no NULs, people think of them as character strings.
> Treating them as str (py3k-style, ie, Unicode) internally is going to
> buy a bit of intuitiveness now, and I expect over the life of Python 3
> they'll be moving more and more toward full Unicode correctness (as an
> option; GvR insists that a Python str is and will remain a sequence of
> 2-byte units, to be interpreted by the stdlib as UTF-16 characters,
> but otherwise not to be encumbered by Unicode correctness at the
> language and stdlib levels).
> For example, PEP 383 (Python 3 only) provides for roundtripping
> non-decodable sequences of bytes in system interfaces (specifically,
> directory listings, command line items, and environment variables).
> Of course this will work out in Python 2 because you're doing bytes,
> but it's quite painful on Windows because of the deficiencies in the
> mbcs interface. PEP 383 provides a standard workable interface on all
> systems currently capable of running Python 3.
nondecodable sequences are interesting, but I'm not convinced by PEP383
in the context of a VCS. The problem is that the nondecodable name may
be more decodable somewhere else; and that leads to a lovely
roundtripping problem. A second related problem is that filenames may be
referenced as bytes by other software, so a VCS that does what bzr does,
which is honour local encoding for file names, may actually *cause*
interoperability issues [until that software, like 'make', gets fixed].
> > I don't know if we have the willpower to support two languages for any
> > length of time; I suspect our changeover will be driven by the
> > availability of python 3.x (or backports thereof) in long term releases
> > of distributions.
> "Support", of course you should avoid that. That's what everybody is
> saying about their commercial/supported releases, from Twisted to
> Django to Zope to eGenix.
> On the other hand, from the point of view of an open source
> development community, the Python 3 developers really hammered on the
> "(and preferably only one)" in "TOOWTDI". Python 2.7 has most of the
> features of 3.0, but it has an awful lot of deprecated cruft, too.
> Although the docs aren't quite as good as Python 2.6 or Python 2.7
> yet, the language is a lot simpler to understand for the new hacker
> IMO. And it's much nicer to program for interfaces to humans (ie,
> using text from the get-go instead of having str be the encoded
> version) in Python 3.
So far, I've seen few things that would really make our life
dramatically easier. 'with' would be great. An a 70% or so decrease in
runtime would be nice too.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Size: 197 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20090602/677e80bb/attachment.pgp
More information about the bazaar