line endings

Thu Jan 31 16:06:06 GMT 2008

Alexander Belchenko wrote:
> If you're using 1-byte encodings (including utf-8)
> the problem with line-endings pretty simple.
> It's always  \r\n or \n (CRLF or LF).
> 
> But for 2-bytes unicode encodings like UTF-16
> (and I think it's true for 4-bytes UTF-32 as well)
> line-endings becomes more complex, i.e. for UTF16-LE
> 
> \r\0\n\0 and \n\0 (CRLF or LF).

If the eol conversion issue is handled by explicitly
enumerating the files that need it, then the is no
problem.  (Technically anyway.  I would not like this
approach if it were the only option because of the
common case (text file) requires extra work and hard
to keep lists like this in sync with the project.)

If is handled by enumerating the (binary) files that
don't, then this is pretty easy to detect, yes?
(But this has the same usability problem as above,
although perhaps to a lesser degree.  Encoding of
text files wouldn't be known but common cases like
utf16, etc can be fairly reliably detected, yes?)

If a heuristic is used to make the identification,
then again it is pretty easy to detect.  Perhaps:
   file has '\0's?
   no: is text, do eol conversion
   yes:
     are all occurrences of the form above or its BE form?
     no: binary, no eol conversion
     yes: text do eol conversion
Probably the occurrence of a BOM should also be taken
into account.  And of course need list or some way to
specify overrides.

But this is all prior art isn't it?  (tortoiseCVS, svn,
mercurial, others)?  Have there been significant problems?