my strategy on implementing line-endings (eol) support

Thu Apr 3 22:58:44 BST 2008

> > But we *can* be sure this property is correct as part of performing
> EOL translations, right?  If we detect it is not in the encoding we
> think it is (and therefore risk creating a file which is invalid in
> whatever encoding *is* correct), then we can simply refuse to perform
> any EOL translations at all?
> 
> But how we can detect that encoding is not correct? Only by probing it,
> right?

Only while probing it *while* doing EOL processing.

> So we should also probing encoding correctness on commit time, to be
> sure no garbage come to the history. Or we should not narrow the commit
> gates?

No - as I mentioned, we should only check the encoding when we *use* the encoding and would produce garbage if it is wrong.

> According to my example with utf-16 bytestream we never break unicode
> file content
> when we read it from disk, only when we create new file on disk. So,
> every time
> we want to write new file content to disk we need to check encoding.

No - only when we *change* the contents of the file on disk using text based rules (eg, mechanical EOL translation).

> We write new content for files on disk in branch, checkout, pull, local
> push, update and merge operations. Also we produce similar content with 
> cat command.

But please understand that none of these operations attempt to use the encoding.  I hope I have made it clear that I only think this should be done when we are *already* doing operations which treat the file as containing characters, and would write garbage if it is wrong.

Again: only when it is *already* necessary to interpret bytes as characters should we check the bytes are what we think they are.  This would happen at no other time.

> Yes, I understand this analogy with Python.
> 
> > However, it is a matter of philosophy - I always take the approach
> that it's easier to make a correct program fast than it is to make a
> fast program correct :)  I think I've made my point, so I'm happy to
> let things rest there...
> 
> Sorry, but I still don't understand your proposal.
> 
> If we using 'ascii' as default encoding then what?
> Any file that has 8th bit set in its content should be refused to use
> eol conversion?

Yes - it is clearly not ascii.  I'd have no problem with the user having the ability to set their default - but "no value" for the encoding property should mean "ascii".

> 'ascii' is valid encoding to do blind s.replace('\n', '\r\n'). Yes?

But that isn't relevant - if the file is not ascii, but you have treated it as ascii, you have corrupted the file.  Only if you are sure it truly does contain ascii can you be sure your output is correct.

> But unicode (utf-16 encoded) files potentially could contains no bytes
> with upper-bit set.
> Yes, I know about BOM, but imagine someone produce utf-16 with some
> script and don't
> put BOM characters. At least for Python interpreter absence of BOM is
> not error.

If we are assuming UCS2 (ie, UTF16 without surrogates) and know the byte order, then I see no reason you couldn't also optimize this encoding by replacing with a  blind "s.replace("\n\0", "\r\0\n\0")".  It also seems possible to me that utf8 could perform fairly well without truly decoding.

However, I believe UTF16 *can* have high-bits set.

> So, is default should be: if encoding is not set then refuse to do eol
> conversion at all?
> Is it correct?

That is my position, in the same way that Python will refuse to perform any encoding if things are wrong.

Imagine if Python took the position that "foo.decode('utf16")'", if "foo" was not valid utf16, would silently return a string holding invalid utf16 without complaint.  I think it is a *feature* you get an error (ie, Python refuses to perform the operation) in that case.

I think it would be a *feature* of bzr to fail whenever it encountered data which is invalid in the encoding bzr thinks it is - it really is as simple as that.

> In my opinion we should check \0 byte as well, because it most likely
> that utf-16
> encoded file will have at least one \0 inside. Is it correct
> assumption?

No - if you look at all encodings supported by Python, I don't think you can make *any* assumptions about \0 characters, or any others.  If you reduce the subset of encodings you intend supporting, you can start to make such assumptions.

Mark