my strategy on implementing line-endings (eol) support

Wed Apr 2 23:23:29 BST 2008

Hi Alexander,

> Nicholas Allen ?????:
> >
> > |
> > | In my conviction there is 4 types of files:
> > |
> > | 1) binary files
> > | 2) text files with exact line-endings
> > | 3) text files with native/LF/CRLF/CR line-endings

Actually, I've never understood (3) - which is also apparently what subversion does.  To my mind, a text file either has EOL left alone (ie, "exact") or has EOL style set to native (where line ends are transformed).

Is there a use-case for saying a file *must* have (say) '\r' (or even '\n') markers?  I understand that an editor may accidently change them, but that is also true for files marked as "exact-EOL" (ie, those never transformed), and no less damaging.

> > | 4) unicode text files similar to 3.
> > Isn't there just 2 types of files (binary and text)? 4 above is just
> a
> > text file with encoding set to unicode. So I think file encoding
> needs
> > to be another property (UTF8, ASCII, unicode etc).
> 
>  From eol-conversion point of view it's not:
> 
> In [1]: u'\n'.encode('utf-16-le')
> Out[1]: '\n\x00'
> 
> In [2]: u'\n'.encode('utf-16-be')
> Out[2]: '\x00\n'
> 
> In [3]: u'\n'.encode('utf-16')
> Out[3]: '\xff\xfe\n\x00'

I don't see the distinction here either.  IIUC, you are going to need to treat encoded files as characters rather than as bytes - in which case the distinctions above aren't relevant.  Also, I don't see how the BOM marker shown in your utf-16 example is relevant.  Are you simply saying that detecting an appropriate encoding so EOL transformation can be reliably done is the problem, or is there something else I am missing here?

Thanks,

Mark