Bazaar on IronPython

Mon Jun 29 22:59:15 BST 2009

Thank you for looking over this. Filling in a couple of bits not
addressed in the other reply.

2009/6/29 Andrew Bennetts <andrew.bennetts at canonical.com>:
>
> I disagree.  Python 3, like Python 2, has a type designed to hold 8-bit byte
> strings, and mostly Bazaar prefers to work in byte strings rather than
> needlessly decoding then re-encoding them.  Obviously user input is
> typically text, not bytes, but much of the data Bazaar works with is bytes
> from disk or the network.  And while some of our data like revision IDs are
> defined as being serialised as UTF-8, we almost never display them so it's
> much more efficient to handle them as bytestrings (less memory consumption,
> and no computation wasted on decoding/encoding).  So I'd expect that Bazaar
> on implemented Python 3 would make heavy use of the “bytes” type, but Bazaar
> is implemented on Python 2, so that means “str”.
>
> IronPython is broken here, IMO.  Python 2 (and 1!) clearly defines “str” as
> 8-bit bytestrings, and always has.  By choosing to implement them
> differently IronPython has chosen to be arbitrarily incompatible.  So it's
> implementing a language that is rather like Python, but very definitely not
> Python.  Last time I chatted to an IronPython developer (over a year ago,
> admittedly) I got the impression that they realised this was a mistake and
> were considering how to fix it.  Perhaps they're just waiting for everyone
> to move to Python 3?

This is a problem with their model, but it's a problem with bazaar's as well.

Bazaar expects to be able to str-format together any of: paths from
the filesystem, messages from the OS, metadata from bazaar, the
contents of files, and user input, then write it out to the terminal.
There are a number of places that make the effort to do the right kind
of conversions, but lots more that don't, so I frequently get junk
output of one kind or another. The root cause is the same problem
IronPython has - using a single type that treats random binary data
and text interchangeably.

This doesn't necessarily mean bazaar has to start using the unicode
type everywhere, but it does need a clear differentiation between
internal bytes and any other text from the environment. Relying on all
inputs being UTF-8 already, and the terminal being UTF-8, or
everything being ascii, does not work where you have a UTF-16
filesystem, a CP1252 user environment, and a CP850 terminal.

This is all resolvable, but will mean some changes to abstractions.
I'm particularly adverse to interfaces like
bzrlib.xml8._encode_and_escape as commented in the patch - the caller
of an api *has* to know the provenance of a string it supplies,
after-the-fact heuristics are at best inefficient.

>> This line at the bottom of bzrlib.builtins:
>>     from bzrlib.foreign import cmd_dpush
>> pulls in a bunch of extra imports, and makes a difference of about a
>> tenth of a second and a megabyte of disk read to `bzr rocks` on my
>> machine. Or twenty four seconds for IronPython 2.0.0...
>
> Ouch.  Not sure why it would be so slow, bzrlib.foreign is a fairly slim
> module, and doesn't import much that wouldn't already be imported, except
> perhaps bzrlib.branch.

Yes, it seems mostly to be a question of pulling in things like branch
that would otherwise only be imported lazily. It's probably not quite
as bad as it appears as a number of them would be pulled in for
typical operations anyway.

Martin