my strategy on implementing line-endings (eol) support
John Arbash Meinel
john at arbash-meinel.com
Thu Apr 3 10:37:57 BST 2008
-----BEGIN PGP SIGNED MESSAGE-----
Mark Hammond wrote:
| Hi Alexander,
|> Nicholas Allen ?????:
|>> | In my conviction there is 4 types of files:
|>> | 1) binary files
|>> | 2) text files with exact line-endings
|>> | 3) text files with native/LF/CRLF/CR line-endings
| Actually, I've never understood (3) - which is also apparently what subversion
does. To my mind, a text file either has EOL left alone (ie, "exact") or has
EOL style set to native (where line ends are transformed).
| Is there a use-case for saying a file *must* have (say) '\r' (or even '\n')
markers? I understand that an editor may accidently change them, but that is
also true for files marked as "exact-EOL" (ie, those never transformed), and no
Visual Studio 5 or so required that their workspace control files were \r\n. Any
other form and they wouldn't load. So even when checking out on Linux it can be
useful to enforce the line ending. Because then if the files were copied to
another system, they would still have the right endings. (The other files didn't
matter because the compiler didn't care about line endings.)
That is the only case *I* know of. But I assume there could be other tools
(possibly on multiple platforms) that Bork unless you have exactly the expected
|>> | 4) unicode text files similar to 3.
|>> Isn't there just 2 types of files (binary and text)? 4 above is just
|>> text file with encoding set to unicode. So I think file encoding
|>> to be another property (UTF8, ASCII, unicode etc).
|> From eol-conversion point of view it's not:
|> In : u'\n'.encode('utf-16-le')
|> Out: '\n\x00'
|> In : u'\n'.encode('utf-16-be')
|> Out: '\x00\n'
|> In : u'\n'.encode('utf-16')
|> Out: '\xff\xfe\n\x00'
| I don't see the distinction here either. IIUC, you are going to need to treat
encoded files as characters rather than as bytes - in which case the
distinctions above aren't relevant. Also, I don't see how the BOM marker shown
in your utf-16 example is relevant. Are you simply saying that detecting an
appropriate encoding so EOL transformation can be reliably done is the problem,
or is there something else I am missing here?
There is at least 1 step further. Which is that Unicode specifies a new control
character for end-of-line. I don't remember what it is, just that \r and \n
aren't the only end-of-line characters anymore.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
-----END PGP SIGNATURE-----
More information about the bazaar