my strategy on implementing line-endings (eol) support

John Arbash Meinel john at arbash-meinel.com
Thu Apr 3 10:37:57 BST 2008


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Mark Hammond wrote:
| Hi Alexander,
|
|> Nicholas Allen ?????:
|>> |
|>> | In my conviction there is 4 types of files:
|>> |
|>> | 1) binary files
|>> | 2) text files with exact line-endings
|>> | 3) text files with native/LF/CRLF/CR line-endings
|
| Actually, I've never understood (3) - which is also apparently what subversion
does.  To my mind, a text file either has EOL left alone (ie, "exact") or has
EOL style set to native (where line ends are transformed).
|
| Is there a use-case for saying a file *must* have (say) '\r' (or even '\n')
markers?  I understand that an editor may accidently change them, but that is
also true for files marked as "exact-EOL" (ie, those never transformed), and no
less damaging.

Visual Studio 5 or so required that their workspace control files were \r\n. Any
other form and they wouldn't load. So even when checking out on Linux it can be
useful to enforce the line ending. Because then if the files were copied to
another system, they would still have the right endings. (The other files didn't
matter because the compiler didn't care about line endings.)

That is the only case *I* know of. But I assume there could be other tools
(possibly on multiple platforms) that Bork unless you have exactly the expected
line endings.

|
|>> | 4) unicode text files similar to 3.
|>> Isn't there just 2 types of files (binary and text)? 4 above is just
|> a
|>> text file with encoding set to unicode. So I think file encoding
|> needs
|>> to be another property (UTF8, ASCII, unicode etc).
|>  From eol-conversion point of view it's not:
|>
|> In [1]: u'\n'.encode('utf-16-le')
|> Out[1]: '\n\x00'
|>
|> In [2]: u'\n'.encode('utf-16-be')
|> Out[2]: '\x00\n'
|>
|> In [3]: u'\n'.encode('utf-16')
|> Out[3]: '\xff\xfe\n\x00'
|
| I don't see the distinction here either.  IIUC, you are going to need to treat
encoded files as characters rather than as bytes - in which case the
distinctions above aren't relevant.  Also, I don't see how the BOM marker shown
in your utf-16 example is relevant.  Are you simply saying that detecting an
appropriate encoding so EOL transformation can be reliably done is the problem,
or is there something else I am missing here?
|
| Thanks,
|
| Mark

There is at least 1 step further. Which is that Unicode specifies a new control
character for end-of-line. I don't remember what it is, just that \r and \n
aren't the only end-of-line characters anymore.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFH9KV1JdeBCYSNAAMRAhWyAJoCf5T6gyr2jUO9pvqty5btQO6kqgCffBdT
GEab9pMmuX8NeD5zjHiA3LE=
=m6sX
-----END PGP SIGNATURE-----



More information about the bazaar mailing list