line endings

Thu Jan 31 21:46:35 GMT 2008

Stuart McGraw <smcg4191 at frii.com> wrote:
> Alexander Belchenko wrote:
> > If you're using 1-byte encodings (including utf-8)
> > the problem with line-endings pretty simple.
> > It's always  \r\n or \n (CRLF or LF).
> > 
> > But for 2-bytes unicode encodings like UTF-16
> > (and I think it's true for 4-bytes UTF-32 as well)
> > line-endings becomes more complex, i.e. for UTF16-LE
> > 
> > \r\0\n\0 and \n\0 (CRLF or LF).
> 
> If the eol conversion issue is handled by explicitly
> enumerating the files that need it, then the is no
> problem.  (Technically anyway.  I would not like this
> approach if it were the only option because of the
> common case (text file) requires extra work and hard
> to keep lists like this in sync with the project.)

I would _strongly_ recommend having metadata support in the server for
stuff like this. There's no way people would keep separate lists of
which files are of which types.

> If is handled by enumerating the (binary) files that
> don't, then this is pretty easy to detect, yes?
> (But this has the same usability problem as above,
> although perhaps to a lesser degree.  Encoding of
> text files wouldn't be known but common cases like
> utf16, etc can be fairly reliably detected, yes?)
> 
> If a heuristic is used [...]

The server should support some kind of file type metadata, e.g. a mime
Content-Type as per RFC 2045. That would make it possible to make
clients where this property can be set on an as-needed basis (opt-in).

Now, if the client sets this property value by default to
"octet-stream" or uses libmagic (like the unix "file" utility) or some
other "works-95%-of-the-time" heuristics or whatever is a very minor
detail IMO, as long is it can be set manually whenever this default
value is wrong.

- Marcus