Stephen J. Turnbull
stephen at xemacs.org
Tue Jun 2 07:27:26 BST 2009
Robert Collins writes:
> Unicode in python 2.x is terribly slow: its four times the memory size
> (for linux builds at least),
Er, what are you putting in memory, that this matters?
Anyway, although Python has (correctly IMO) settled on widechars
internally, not UTF-8, the actual choice of width is up to the
builder. By popular demand Python must provide a 2-byte version (for
Windows) and a 4-byte version (for Uli Drepper-afflicted systems), and
that's not going to change. I doubt that distros are going to buck
glibc on this issue, though, so it's UCS-4 for Linux, I'm sure.
> and conversion is painfully slow.
In what context? Maybe it's going to be slower than libc's iconv
functions, but it's certainly comparable to XEmacs coding systems (at
least for ASCII, ISO 8859/1, Unicode, and CJK encodings, I'm not sure
about Latin-X for X != 1 and other banana-flavored variants). XEmacs
can just about keep up with disk I/O on all the machines I've ever had
for Unicode, and it's really not a factor for CJK, although it's
somewhat slower than disk I/O. I haven't benchmarked Python codecs
that way, but this is the first time I've *ever* heard them called
"painfully slow". There are no bugs for "codec slow" on
bugs.python.org (except issue2857, which is clearly something else).
> We've accomodated that in our code by being very careful about
> whether we need to convert, and trying to only ever convert
> once. Do you happen to know if this is addressed in python 3?
I don't understand what you want addressed. The speed? Speed is not
going to change dramatically for Unicode and CJK, they're written in C
(and AFAIK always have been). The code is better than I could write
(which is not saying all that much, but I do know most of the common
optimizations used in writing codecs). If you're converting bits and
pieces at a time (as happens in email, for example) you're going to
take the function call hit, etc. I suppose that would be the case for
bzr, too, but I'm not familiar with the internals.
But there are other reasons for converting exactly once on the way in,
and exactly once on the way out. Correctness reasons. Specifically,
you can't be sure that users are going to type the same thing on the
command line that is in the directory on disk, especially in a Unicode
environment. So you want to normalize as well as convert. It's true
that in most cases file systems like HFS+ will DTRT for you, but
occasionally they can get confused. This is most frequent in my
experience when you're doing cross-platform work (and I use VCSes for
that a lot...).
> The clear separation of bytes and text is useful for us, because it
> lets us stay in bytes all the way until we actually want to render
> strings for users.
It's not clear to me whether or not that's a good idea. The thing is,
although POSIX defines a file name basically as a sequence of octets
containing no NULs, people think of them as character strings.
Treating them as str (py3k-style, ie, Unicode) internally is going to
buy a bit of intuitiveness now, and I expect over the life of Python 3
they'll be moving more and more toward full Unicode correctness (as an
option; GvR insists that a Python str is and will remain a sequence of
2-byte units, to be interpreted by the stdlib as UTF-16 characters,
but otherwise not to be encumbered by Unicode correctness at the
language and stdlib levels).
For example, PEP 383 (Python 3 only) provides for roundtripping
non-decodable sequences of bytes in system interfaces (specifically,
directory listings, command line items, and environment variables).
Of course this will work out in Python 2 because you're doing bytes,
but it's quite painful on Windows because of the deficiencies in the
mbcs interface. PEP 383 provides a standard workable interface on all
systems currently capable of running Python 3.
> I don't know if we have the willpower to support two languages for any
> length of time; I suspect our changeover will be driven by the
> availability of python 3.x (or backports thereof) in long term releases
> of distributions.
"Support", of course you should avoid that. That's what everybody is
saying about their commercial/supported releases, from Twisted to
Django to Zope to eGenix.
On the other hand, from the point of view of an open source
development community, the Python 3 developers really hammered on the
"(and preferably only one)" in "TOOWTDI". Python 2.7 has most of the
features of 3.0, but it has an awful lot of deprecated cruft, too.
Although the docs aren't quite as good as Python 2.6 or Python 2.7
yet, the language is a lot simpler to understand for the new hacker
IMO. And it's much nicer to program for interfaces to humans (ie,
using text from the get-go instead of having str be the encoded
version) in Python 3.
More information about the bazaar