Bazaar on IronPython
gzlist at googlemail.com
Mon Jun 29 22:59:15 BST 2009
Thank you for looking over this. Filling in a couple of bits not
addressed in the other reply.
2009/6/29 Andrew Bennetts <andrew.bennetts at canonical.com>:
> I disagree. Python 3, like Python 2, has a type designed to hold 8-bit byte
> strings, and mostly Bazaar prefers to work in byte strings rather than
> needlessly decoding then re-encoding them. Obviously user input is
> typically text, not bytes, but much of the data Bazaar works with is bytes
> from disk or the network. And while some of our data like revision IDs are
> defined as being serialised as UTF-8, we almost never display them so it's
> much more efficient to handle them as bytestrings (less memory consumption,
> and no computation wasted on decoding/encoding). So I'd expect that Bazaar
> on implemented Python 3 would make heavy use of the “bytes” type, but Bazaar
> is implemented on Python 2, so that means “str”.
> IronPython is broken here, IMO. Python 2 (and 1!) clearly defines “str” as
> 8-bit bytestrings, and always has. By choosing to implement them
> differently IronPython has chosen to be arbitrarily incompatible. So it's
> implementing a language that is rather like Python, but very definitely not
> Python. Last time I chatted to an IronPython developer (over a year ago,
> admittedly) I got the impression that they realised this was a mistake and
> were considering how to fix it. Perhaps they're just waiting for everyone
> to move to Python 3?
This is a problem with their model, but it's a problem with bazaar's as well.
Bazaar expects to be able to str-format together any of: paths from
the filesystem, messages from the OS, metadata from bazaar, the
contents of files, and user input, then write it out to the terminal.
There are a number of places that make the effort to do the right kind
of conversions, but lots more that don't, so I frequently get junk
output of one kind or another. The root cause is the same problem
IronPython has - using a single type that treats random binary data
and text interchangeably.
This doesn't necessarily mean bazaar has to start using the unicode
type everywhere, but it does need a clear differentiation between
internal bytes and any other text from the environment. Relying on all
inputs being UTF-8 already, and the terminal being UTF-8, or
everything being ascii, does not work where you have a UTF-16
filesystem, a CP1252 user environment, and a CP850 terminal.
This is all resolvable, but will mean some changes to abstractions.
I'm particularly adverse to interfaces like
bzrlib.xml8._encode_and_escape as commented in the patch - the caller
of an api *has* to know the provenance of a string it supplies,
after-the-fact heuristics are at best inefficient.
>> This line at the bottom of bzrlib.builtins:
>> from bzrlib.foreign import cmd_dpush
>> pulls in a bunch of extra imports, and makes a difference of about a
>> tenth of a second and a megabyte of disk read to `bzr rocks` on my
>> machine. Or twenty four seconds for IronPython 2.0.0...
> Ouch. Not sure why it would be so slow, bzrlib.foreign is a fairly slim
> module, and doesn't import much that wouldn't already be imported, except
> perhaps bzrlib.branch.
Yes, it seems mostly to be a question of pulling in things like branch
that would otherwise only be imported lazily. It's probably not quite
as bad as it appears as a number of them would be pulled in for
typical operations anyway.
More information about the bazaar