my strategy on implementing line-endings (eol) support
Alexander Belchenko
bialix at ukr.net
Thu Apr 3 11:12:30 BST 2008
Mark Hammond пишет:
> I wrote:
>
>> That sounds good. So getting back to the initial point:
>>
>>> 3) text files with native/LF/CRLF/CR line-endings
>>> 4) unicode text files similar to 3.
>> I'm suggesting that it would be less error prone to treat a missing
>> encoding as meaning "ascii", and if you do, 3 and 4 become effectively
>> identical.
>
> To be clear - there are certainly optimizations which could be applied to the implementation to operate at the byte level - but these are also available for certain non-ascii encodings, and in all cases, we should still ensure that the bytes we are processing are valid in the encoding we think we are dealing with. But these optimizations should be completely invisible to the user...
I'm not sure I understand your point completely.
Even if we force the user to set encoding property correctly we anyway can't be sure that this property is correct.
I called it "paranoid" mode. So if we really want to be paranoid, we should check that encoding property is actually
right. I.e. we should check every file by decoding its bytestream to unicode. It means we should spend a big amount of
time just to check that user settings is correct.
As a Cyrillic-minded man I can't agree with you that default encoding should be 'ascii'. It's the safe variant, it's
true, but it does not help to speed things up. Any paranoid thinking about encodings will slow us down as O(N). I don't
think anybody want this.
So we either should trust to user settings *or* check explicitly for NUL bytes in file content to prevent incorrect eol
conversion on possibly unicode files. That's exactly the way hg win32text extension works. This is also slow down
writing files to disk, of course, but it prevents worst scenario you described with broken unicode files.
More information about the bazaar
mailing list