my strategy on implementing line-endings (eol) support

Thu Apr 3 11:12:30 BST 2008

Mark Hammond пишет:
> I wrote:
> 
>> That sounds good.  So getting back to the initial point:
>>
>>> 3) text files with native/LF/CRLF/CR line-endings
>>> 4) unicode text files similar to 3.
>> I'm suggesting that it would be less error prone to treat a missing
>> encoding as meaning "ascii", and if you do, 3 and 4 become effectively
>> identical.
> 
> To be clear - there are certainly optimizations which could be applied to the implementation to operate at the byte level - but these are also available for certain non-ascii encodings, and in all cases, we should still ensure that the bytes we are processing are valid in the encoding we think we are dealing with.  But these optimizations should be completely invisible to the user...

I'm not sure I understand your point completely.

Even if we force the user to set encoding property correctly we anyway can't be sure that this property is correct.
I called it "paranoid" mode. So if we really want to be paranoid, we should check that encoding property is actually 
right. I.e. we should check every file by decoding its bytestream to unicode. It means we should spend a big amount of 
time just to check that user settings is correct.

As a Cyrillic-minded man I can't agree with you that default encoding should be 'ascii'. It's the safe variant, it's 
true, but it does not help to speed things up. Any paranoid thinking about encodings will slow us down as O(N). I don't 
think anybody want this.

So we either should trust to user settings *or* check explicitly for NUL bytes in file content to prevent incorrect eol 
conversion on possibly unicode files. That's exactly the way hg win32text extension works. This is also slow down 
writing files to disk, of course, but it prevents worst scenario you described with broken unicode files.