my strategy on implementing line-endings (eol) support
Mark Hammond
mhammond at skippinet.com.au
Wed Apr 2 23:23:29 BST 2008
Hi Alexander,
> Nicholas Allen ?????:
> >
> > |
> > | In my conviction there is 4 types of files:
> > |
> > | 1) binary files
> > | 2) text files with exact line-endings
> > | 3) text files with native/LF/CRLF/CR line-endings
Actually, I've never understood (3) - which is also apparently what subversion does. To my mind, a text file either has EOL left alone (ie, "exact") or has EOL style set to native (where line ends are transformed).
Is there a use-case for saying a file *must* have (say) '\r' (or even '\n') markers? I understand that an editor may accidently change them, but that is also true for files marked as "exact-EOL" (ie, those never transformed), and no less damaging.
> > | 4) unicode text files similar to 3.
> > Isn't there just 2 types of files (binary and text)? 4 above is just
> a
> > text file with encoding set to unicode. So I think file encoding
> needs
> > to be another property (UTF8, ASCII, unicode etc).
>
> From eol-conversion point of view it's not:
>
> In [1]: u'\n'.encode('utf-16-le')
> Out[1]: '\n\x00'
>
> In [2]: u'\n'.encode('utf-16-be')
> Out[2]: '\x00\n'
>
> In [3]: u'\n'.encode('utf-16')
> Out[3]: '\xff\xfe\n\x00'
I don't see the distinction here either. IIUC, you are going to need to treat encoded files as characters rather than as bytes - in which case the distinctions above aren't relevant. Also, I don't see how the BOM marker shown in your utf-16 example is relevant. Are you simply saying that detecting an appropriate encoding so EOL transformation can be reliably done is the problem, or is there something else I am missing here?
Thanks,
Mark
More information about the bazaar
mailing list