my strategy on implementing line-endings (eol) support

Thu Apr 3 06:06:12 BST 2008

> > Actually, I've never understood (3) - which is also apparently what
> subversion does.  To my mind, a text file either has EOL left alone
> (ie, "exact") or has EOL style set to native (where line ends are
> transformed).
> >
> > Is there a use-case for saying a file *must* have (say) '\r' (or even
> '\n') markers?  I understand that an editor may accidently change them,
> but that is also true for files marked as "exact-EOL" (ie, those never
> transformed), and no less damaging.
> 
> I have next use case: developer working on Windows on python script and
> then for testing he simply
> copying it via ssftp/ssh/samba/whatever to Linux. He has executable bit
> set. He run script simply from command-line, e.g.:
> ./myscript.py
> and got error about incorrect interpreter.
> His script has shebang at the start of file, i.e.
> #!/usr/bin/python
> but script won't starting. Why?
> Because shebang line ends with \r character. Yep.
> 
> I'm stepping into this many times.

Yes, me too.  I fully understand the motivation and I think everyone is in violent agreement about how important native support is, and how evil mixing eols is.

The way I see it, if you check out on Windows and copy the distro to *nix (or vice-versa), then yes, the shell scripts do indeed still work - but the rest of the distribution is still somewhat "broken" - the only real difference is that the tools used for the rest of the distribution *generally* don't care about line endings, while shell scripts are picky.  So while I agree the scale of the problem is worse (ie, in practice, the shell scripts fail to work whereas other files usually do) I still fail to see the fundamental difference between the situations.

But please don't let me waste any more bandwidth on this issue.

> Support for native eol is the must
> have, and most of people need only natives, I believe. 
> But when I have native eol, implementing support for others 
> is just trivial task.

Yeah, I see now that as a normalized copy of the file is stored in the repository, these are all needed.

> > I don't see the distinction here either.  IIUC, you are going to need
> to treat encoded files as characters rather than as bytes - in which
> case the distinctions above aren't relevant.  Also, I don't see how the
> BOM marker shown in your utf-16 example is relevant.  Are you simply
> saying that detecting an appropriate encoding so EOL transformation can
> be reliably done is the problem, or is there something else I am
> missing here?
> 
> My bad. It was wrong example. Here is the correct one:
> 
> In [1]: u'\n'.encode('utf-16-le')
> Out[1]: '\n\x00'
> 
> In [2]: u'\r\n'.encode('utf-16-le')
> Out[2]: '\r\x00\n\x00'
> 
> In [3]: '\n\x00'.replace('\n', '\r\n')
> Out[3]: '\r\n\x00'
> 
> In [4]: '\r\n\x00'.decode('utf-16-le')
> -----------------------------------------------------------------------
> ----
> <type 'exceptions.UnicodeDecodeError'>    Traceback (most recent call
> last)

Yes, this is my point - your examples treating things as bytes rather than characters.

> My example shows that I can't blindly replace '\n' on '\r\n' for utf-16
> files. So this files required special handling IMO.

Exactly.  In the more general case, you can't blindly replace '\n' with '\r\n' for any stream of bytes without knowing the encoding of those bytes.  You also can't assume that files are utf16 or anything else (unless you detect a BOM, but that is apparently rare "in the wild"). It might be possible to "normalize" the stored stream (and therefore you could make assumptions about that), but I don't see how you can make any assumptions about bytes checked in by the user. You might need another property for the encoding of the file, which defaults to "ascii", so you can decode your bytes, apply your transformation, then optionally re-encode them the way they came in - plus a sane strategy for dealing with decoding errors.  Obviously heuristics built around the many existing conventions for declaring a source file encoding could also be used, but you still need to handle the "unknown encoding" case.

Cheers,

Mark