my strategy on implementing line-endings (eol) support

Alexander Belchenko bialix at ukr.net
Thu Apr 3 06:25:44 BST 2008


Mark Hammond пишет:
>>> I don't see the distinction here either.  IIUC, you are going to need
>> to treat encoded files as characters rather than as bytes - in which
>> case the distinctions above aren't relevant.  Also, I don't see how the
>> BOM marker shown in your utf-16 example is relevant.  Are you simply
>> saying that detecting an appropriate encoding so EOL transformation can
>> be reliably done is the problem, or is there something else I am
>> missing here?
>>
>> My bad. It was wrong example. Here is the correct one:
>>
>> In [1]: u'\n'.encode('utf-16-le')
>> Out[1]: '\n\x00'
>>
>> In [2]: u'\r\n'.encode('utf-16-le')
>> Out[2]: '\r\x00\n\x00'
>>
>> In [3]: '\n\x00'.replace('\n', '\r\n')
>> Out[3]: '\r\n\x00'
>>
>> In [4]: '\r\n\x00'.decode('utf-16-le')
>> -----------------------------------------------------------------------
>> ----
>> <type 'exceptions.UnicodeDecodeError'>    Traceback (most recent call
>> last)
> 
> Yes, this is my point - your examples treating things as bytes rather than characters.

Yes, of course. But decoding all bytestreams to unicode blindly will be major slowdown for 
operations like status and diff and commit.

>> My example shows that I can't blindly replace '\n' on '\r\n' for utf-16
>> files. So this files required special handling IMO.
> 
> Exactly.  In the more general case, you can't blindly replace '\n' with '\r\n' for any stream of bytes without knowing the encoding of those bytes.  You also can't assume that files are utf16 or anything else (unless you detect a BOM, but that is apparently rare "in the wild"). It might be possible to "normalize" the stored stream (and therefore you could make assumptions about that), but I don't see how you can make any assumptions about bytes checked in by the user. You might need another property for the encoding of the file, which defaults to "ascii", so you can decode your bytes, apply your transformation, then optionally re-encode them the way they came in - plus a sane strategy for dealing with decoding errors.  Obviously heuristics built around the many existing conventions for declaring a source file encoding could also be used, but you still need to handle the "unknown encoding" case.

I'm planning to let user decide what encoding has content and then set appropriate file property, 
called "encoding".




More information about the bazaar mailing list