bug: files with non-ascii chars?

Wed Jan 10 20:49:00 GMT 2007

Ramon Diaz-Uriarte wrote:
> Dear All,
> 
> This isn't really serious, but it might seem worrisome. In a directory
> where I accidentally named a file
> 
> f3.pngç (the weird char is the "ç").

As near as I can tell, we have found a bug in python. Specifically if we do:

# touch foo, fooç
python
>>> open('foo', 'wb').close()
>>> open(u'foo\xe7', 'wb').close()

LANG=C python
>>> import os
>>> os.listdir(u'.')
[u'foo', 'foo\xc3\xa7']
         ^^- Notice that this is a plain string, not a unicode string.
And it is storing the utf-8 bytestream of the filename, not a unicode
string.

I would have expected that if we did "os.listdir('.')" since that always
returns the bytestreams. But u'.' is supposed to return decoded strings.

Now, I guess they are trying to be helpful and catch the decoding
exception and return *something* back. The problem is during stuff like
list.sort() it is comparing a plain string to a unicode string, and
internally it upcasts the plain string to unicode, which is what is
blowing up.

> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe7 in position
> 6: ordinal not in range(128)
> 
> bzr 0.13.0 on python 2.4.4.candidate.0 (linux2)
> arguments: ['/usr/bin/bzr', 'status']
> 
> 
> Best,
> 
> R.
> 

I'm not sure what we can do about it. The best I can come up with is to
somehow filter the result of 'os.listdir()' and remove anything that is
not Unicode, since we know it will blow up later. Unfortunately that
means we also cannot sort the strings into the final list, because
comparing them against a unicode string will also die.

To prove my point:
>>> u'\xb5' < '\xb5'
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xb5 in position 0:
ordinal not in range(128)

So what we *could* do is:

children = [(isinstance(x, str) and x.decode('latin-1')) or x
            for x in os.listdir(base)]

The reason to use 'latin-1' is because it has a valid mapping for all
characters, while utf-8 will fail for some combinations. If it wasn't
for that fact, I would rather fall back to utf-8. We *could* do a 3-way
fallback, but that starts getting really ugly. Also, on Windows it would
probably be best to use the OEM codepage... But since this is only meant
to happen when there is an error. And on Windows it always uses Unicode
for filenames, going through the 'mbcs' encoding, and we should be using
the windows Unicode apis so this shouldn't really be possible there.

Most likely, that will at least let us properly report that a given file
is not versioned, etc. Rather than blowing up.

But it is a pretty ugly hack.

We certainly want to support ignoring files that cannot be represented
in the filesystem encoding (since that does happen). We try pretty hard
not to blow up just because a given file is invalid (we skip sockets,
fifo's, etc for this reason).

The other possibility would be to just filter them out of the list, and
work out how to bring them back in. So something like:

children = []
invalid = []

for path in os.listdir(base):
  if isinstance(str):
    invalid.append(path)
  else:
    children.append(path)

And then after the current loop, we just add

for path in invalid:
  yield path, '?', osutils.filekind(path), None, fk_entries[fk]()

But that means that the output of WorkingTree.list_files() is no longer
in sorted order. (And it can't be, because there is no sorting between
plain strings and Unicode if the plain string can't be decoded into
Unicode).

Does anyone have a feeling of what might work for us here? My goal is to
somehow mark the file as "illegal" or "unversionable" and just keep going.

John
=:->