Forking baazar to add Python 3.x support

Stephen J. Turnbull stephen at xemacs.org
Sat Mar 8 03:42:38 UTC 2014


Mark Grandi writes:

 > Well I didn't mean it that way, what I meant was, since windows users
 > rarely have python installed, and the bzr installer includes it's own
 > version of python, whether we use python 2 or 3 does not matter because
 > on windows we control the exact version of python used by bzr. If we
 > did switch, a version of python would have to be decided upon and that
 > would be used to program plugins against and included with installers,
 > etc.

Sure, and the counter is that because of the accumulation of plugins,
many of which seem to be more or less orphaned because both they and
Bazaar itself are mature programs, just porting the plugins is likely
to be a very large job.  It's not such a big deal within Python 2 or
within Python 3 -- both because the plug-in maintainer can often do
the fixes needed for x.y -> x.(y+1) easily, and because often a third
party can and will do them -- but it's likely to block adoption of
Bazaar/Python3 for many users for quite some time because 2 to 3
involves subtleties that might bite.  That would be unacceptable in a
VCS, so either the maintainer (who presumably knows the plugin code
intimately) needs to budget a fair amount of time, or a 3rd party
needs to do a lot of booking up.

 > However, has anyone read the recent 'drama' I would say on the hg
 > mailing list about python 3? They like bazaar use python 2 at the
 > moment and are having trouble porting to py3 because of the changes to
 > strings, where you can't treat strings as bytes, and they were
 > using .format() to construct certain byte strings, and are lamenting
 > that the new bytes type in py3 does not have a format method.

People whine about that a lot, but it's not actually that big a deal
to work around.  The blockers for "working around" seem to be two:
(1) The core Python developers are and have long been divided on the
    subject of adding %-formatting or .format to bytes.  It now seems
    like they're actually going to do the former (PEP 461 has been
    accepted in principle as far as I can tell).  When they do, the
    workaround become unnecessary, inefficient, and ugly, so the hg
    developers would have to redo the affected code again.
(2) Extra function calls are ugly and inefficient, and efficiency is
    very important to code implementing wire protocols.  Ugly code is
    also hard to understand and debug, while efficient implementations
    are often tricky algorithms.

 > I personally would be very confused on how format() would work with a
 > bytes type,

There won't be a .format(), most likely.  .format() is very powerful
and highly configurable for generating special features of human-
readable text automatically.  Wire protocols are simpler, have no
Unicode Standard considerations, etc.  So %-formatting was chosen for
PEP 461.  %-formatting becomes polymorphic: bytes % tuple -> bytes,
and str % tuple -> str.  Bytes %-formatting is a tiny bit limited
compared to str (%r is not supported, but you do get %a which serves
the same set of use-cases in a bytes-compatible way), but otherwise
very familiar.

 > but I feel like many python programs, treating strings as bytes as
 > wrong from the start, and they should rewrite the code to use the
 > bytes type in py3.

Well, that's not obvious.  HTML (especially as produced by, say,
Microsoft Outlook MUAs) is clearly a wire protocol.  It's not fit for
pig swill, let alone human consumption.  But it's just as obviously
text, really: all of the protocol is human-readable pseudo-English
encoded in ASCII.  So why not do all your composition in str (ie,
Unicode), and just .encode to UTF-8 or whatever when you squirt it out
on the wire?

The problem is that that's fine for HTML, but people also like to do
things like

"""
Content-Type: %s
Content-Length: %d

%s
""" % ('image/jpeg', len(jpeg_bytes), jpeg_bytes)

which (even with PEP 383, which handles malformed text encodings by
encoding the uninterpretable bytes as non-character Unicode surrogate
codes) is a semantic horror.  In the most popular approach (where the
text encoding is ASCII, so that all bytes 128-255 are encoded as
surrogates, this doubles the space requirement (even with PEP 393,
each str is fixed-width, and surrogates require 2 bytes each).  Still,
I would be very tempted by this approach (precisely because of PEP
383).  There's an alternative approach using ISO 8859/1 as the encoding
for bytes objects into str (which in recent Python means an 8-bit
representation of str due to PEP 393), but I wouldn't touch that with
a 10-foot pole -- it's too easy for binary blobs to leak into what
should be human-readable text, and create mayhem.

As VCSes do a huge amount of network traffic when branching, and large
amounts when doing other remote operations, aside from sending things
that are normally binary blobs, you also want to compress ordinary
text (programs, documentation, and VCS metadata) on the wire.  (It's
not obvious to me that doing that in the VCS makes sense, as most TLS
utilities will offer a compression feature -- just use TLS and turn it
on!  But whatever, and in any case, transmission of blobs is common.)

So you want to use bytes.  Now what?  Well, in current Python 3 you
can't do the kind of thing the above example does, because there's no
formatting for bytes, only concatenation (of bytes) and insertion and
replacement (in bytearrays).  (Of course the example is a strawman,
but there are many structures -- eg, IMAP netstrings -- where
backfilling metadata into such formats make a lot of sense.)  So you'd
need to do something like

b"\r\n".join((b"Content-Type: " + type,
              b"Content-Length: " + ("%d" % len(blob)).encode('ascii'),
              b"",
              blob))

where of course you could avoid the .join by building a list of
headers and payloads, then wire.write()'ing them directly.  What you
can't avoid is the mess in the Content-Length header, you can only
encapsulate it.  But encapsulation has its pitfalls, too.  You might
think, "OK,

    def ascify(o):
        return ("%s" % o).encode('ascii')

and I'm done", but you're not -- .encode will signal on many objects
(and as a Mailman developer I can assure you that you *will* run into
those objects in practical use -- non-ASCII in mail headers is a
bugaboo that took a decade to completely deal with, at least I *think*
we've put that issue behind us :-).  So you really need an ascify()
for each type you might use (ascify_integer, for example, is safe,
just return the %d-formatted number), plus a lot of exception handling
for generic codes like "%s" which are very likely to be fed arbitrary
user-provided data.

Of course a %-formatter for bytes is subject to the "%s may raise
UnicodeError" issue, but the Mercurial developers and Bazaar
developers are pretty careful.  Where Python makes things easy for
them, they'll probably do a good job of trapping that kind of
exception.  But it's very easy to do a quick design on 'ascify' and
"just get the job done" when you've got this loud Greek chorus in the
back of your head going "Forking python-dev, what the hail were they
thinking?" (repeat ad infinitum).  Speaking for myself, of course, you
may be more disciplined. :-)  And it does add layers, which makes
what's actually happening less transparent (and possibly more buggy).

 > Anyway, anyone have any insights on of bazaar has the same sort of
 > issues?

It most definitely does have the same kind of issues, both on the wire
to the internet, and on the "wire" to the disk (it's often useful to
think of file formats as wire protocols).

I haven't thought about it carefully, but if I were going to design an
approach to this as part of a port of Bazaar to Python 3, I'd think as
follows

(1) PEP 461 will be approved soon in pretty much its current form, and
    Python 3.5 will get bytes %-formatting.

(2) In fairly short order, PEP 461 will get a pure-Python
    implementation.  (If it doesn't, then it won't be too hard to
    write one good enough for prototyping in terms of str %-formatting
    or maybe .format() -- the PEP 461 API will not change before
    approval IMO, so targeting the current version should be safe.)
    Copy this to Bazaar, and use it to implement bytes
    %-formatting in prototypes of Bazaar/Python3.

(3) Since Python 3.5 should show up around late 2015, I would guess
    the Bazaar port to Python 3 at best will be neck and neck with the
    arrival of "real" bytes %-formatting.  Declare Python 3.5 to be
    the minimum Python version to be supported by Bazaar/Python3.
    Since PEP 461 will be an early arrival in trunk, you should be
    able to work with Python trunk (perhaps specified revisions would
    be a good idea) pretty soon -- step 2 might not be necessary,
    depending on how much Python churn the Bazaar workers can stand.

I think the above strategy is pretty obvious; I don't mean to insult
anyone's intelligence by claiming it's highly insightful.  But it's
worth writing down, so there it is. :-)

Steve



More information about the bazaar mailing list