python-3.x?

Tue Jun 2 13:20:24 BST 2009

Robert Collins writes:

 > > Er, what are you putting in memory, that this matters?
 > 
 > Off the top of my head:
 >  - paths of various sorts - utf8(whats in the dirstate),
 > "native"-disk-encoding(utf8-assumed for linux, unicode for windows).
 > unicode normalising file systems get into fun territory here.
 >    + for these the size isn't hugely critical, even a 100K path tree is
 > only going to be what, 20ish MB with 4-byte codepoints.
 >    + converting these to and from native encoding has proven to be a
 > significant bottleneck. Your theories about performance notwithstanding,
 > bzr status would be significantly slower if we didn't keep conversions
 > to a minimum (and on linux wuth a utf8 locale we avoid converting at
 > all).

Er, my only theories about performance are that it's not going to
improve by much, if at all, in Python 3, and that any performance
bottleneck is unlikely to be in the codecs per se.  If you need to
convert 100K paths, though, that's a pile of function calls, and those
aren't fast in Python.

 >  - user file content, which ranges in size from 100-byte files to
 > hundreds of MB.

This normally shouldn't be converted except at user request anyway
(I'm thinking of a legacy charset equivalent to EOL translation here),
and users who want that should expect to pay for it.

Diffs and merge presentations are an interesting problem.

It may not much matter, though, since I suspect that for large files
converting to Unicode would be a very small part of time consumed if
you use a Python diff, and if you use an external diff I assume you'll
handle that by just handing off to the external program.  Even if you
decide to convert in the process of checkout out a requested revision,
that's unlikely to be anywhere near as costly as converting 100,000
short strings.  (Anybody who requests a diff that includes hunks from
100,000 files is going to regret it, anyway....)

 >    + we provide colourisation and syntax highlighting in various tools
 > built on the core bzrlib; for these we can reasonably expect the higher
 > level libraries to start wanting 'text' rather than 'basestr' to work
 > on.

 > Sure, and I think you'll find we do that. We just avoid converting
 > if we don't have to. Internally, in the inventory, we're utf8 all
 > the way.

What do you do if you run into a file name that can't be converted to
UTF-8?  Tell the user to rename it?  Crash?<wink>

 > HFS+ is pure evil from bzr's perspective, because 'open()' + 'listdir()'
 > doesn't round trip. John has all the details in his head, if you want
 > more info :).

No, don't need them.  We've got tests in XEmacs for something similar.

 > nondecodable sequences are interesting, but I'm not convinced by PEP383
 > in the context of a VCS.

PEP 383 doesn't make it possible to roundtrip outside of a very
limited context; if you use chr() anywhere (equivalently \u codes)
that could get appended to a filesystem path, you've already voided
the warrantee for that process.  However, based on PEP 383 you *can*
build a system that will allow roundtripping in the context of a given
application (including across processes and even systems, as long as
they're all Unicode-compatible).

That doesn't help with applications like make, of course, since the
Makefile probably has them encoded in VSCII or something like that.

 > So far, I've seen few things that would really make our life
 > dramatically easier. 'with' would be great. An a 70% or so decrease
 > in runtime would be nice too.

Not going to happen.  Currently people are happy that 70% of the
benchmarks show little or no increase in runtime.  And I'm pretty sure
you'll get 'with' in 2.7 or failing that 2.8.

But I'll tell you what would make your life dramatically easier: a
couple dozen volunteer hackers writing clear code because thats The
One Obvious Way To Do It.  Not going to happen, either, but we can
dream, can't we?