my strategy on implementing line-endings (eol) support

Thu Apr 3 05:27:08 BST 2008

Mark Hammond пишет:
> Hi Alexander,
> 
>> Nicholas Allen ?????:
>>> |
>>> | In my conviction there is 4 types of files:
>>> |
>>> | 1) binary files
>>> | 2) text files with exact line-endings
>>> | 3) text files with native/LF/CRLF/CR line-endings
> 
> Actually, I've never understood (3) - which is also apparently what subversion does.  To my mind, a text file either has EOL left alone (ie, "exact") or has EOL style set to native (where line ends are transformed).
> 
> Is there a use-case for saying a file *must* have (say) '\r' (or even '\n') markers?  I understand that an editor may accidently change them, but that is also true for files marked as "exact-EOL" (ie, those never transformed), and no less damaging.

I have next use case: developer working on Windows on python script and then for testing he simply 
copying it via ssftp/ssh/samba/whatever to Linux. He has executable bit set. He run script
simply from command-line, e.g.:
./myscript.py
and got error about incorrect interpreter.
His script has shebang at the start of file, i.e.
#!/usr/bin/python
but script won't starting. Why?
Because shebang line ends with \r character. Yep.

I'm stepping into this many times. Support for native eol is the must have, and most of people need 
only natives, I believe. But when I have native eol, implementing support for others is just trivial 
task.

>  
>>> | 4) unicode text files similar to 3.
>>> Isn't there just 2 types of files (binary and text)? 4 above is just
>> a
>>> text file with encoding set to unicode. So I think file encoding
>> needs
>>> to be another property (UTF8, ASCII, unicode etc).
>>  From eol-conversion point of view it's not:
>>
>> In [1]: u'\n'.encode('utf-16-le')
>> Out[1]: '\n\x00'
>>
>> In [2]: u'\n'.encode('utf-16-be')
>> Out[2]: '\x00\n'
>>
>> In [3]: u'\n'.encode('utf-16')
>> Out[3]: '\xff\xfe\n\x00'
> 
> I don't see the distinction here either.  IIUC, you are going to need to treat encoded files as characters rather than as bytes - in which case the distinctions above aren't relevant.  Also, I don't see how the BOM marker shown in your utf-16 example is relevant.  Are you simply saying that detecting an appropriate encoding so EOL transformation can be reliably done is the problem, or is there something else I am missing here?

My bad. It was wrong example. Here is the correct one:

In [1]: u'\n'.encode('utf-16-le')
Out[1]: '\n\x00'

In [2]: u'\r\n'.encode('utf-16-le')
Out[2]: '\r\x00\n\x00'

In [3]: '\n\x00'.replace('\n', '\r\n')
Out[3]: '\r\n\x00'

In [4]: '\r\n\x00'.decode('utf-16-le')
---------------------------------------------------------------------------
<type 'exceptions.UnicodeDecodeError'>    Traceback (most recent call last)

c:\work\<ipython console> in <module>()

C:\Python25\lib\encodings\utf_16_le.py in decode(input, errors)
      14
      15 def decode(input, errors='strict'):
---> 16     return codecs.utf_16_le_decode(input, errors, True)
      17
      18 class IncrementalEncoder(codecs.IncrementalEncoder):

<type 'exceptions.UnicodeDecodeError'>: 'utf16' codec can't decode byte 0x00 in position 2: 
truncated data

My example shows that I can't blindly replace '\n' on '\r\n' for utf-16 files. So this files 
required special handling IMO.