my strategy on implementing line-endings (eol) support

Alexander Belchenko bialix at ukr.net
Thu Apr 3 13:53:44 BST 2008


Mark Hammond пишет:
>> I'm not sure I understand your point completely.
>>
>> Even if we force the user to set encoding property correctly we anyway
>> can't be sure that this property is correct.
> 
> But we *can* be sure this property is correct as part of performing EOL translations, right?  If we detect it is not in the encoding we think it is (and therefore risk creating a file which is invalid in whatever encoding *is* correct), then we can simply refuse to perform any EOL translations at all?

But how we can detect that encoding is not correct? Only by probing it, right?
So we should also probing encoding correctness on commit time, to be sure
no garbage come to the history. Or we should not narrow the commit gates?


>> I called it "paranoid" mode. So if we really want to be paranoid, we
>> should check that encoding property is actually
>> right. I.e. we should check every file by decoding its bytestream to
>> unicode. It means we should spend a big amount of
>> time just to check that user settings is correct.
> 
> I think it is only necessary to check the setting is correct when we *assume* it is correct (ie, when we will do the wrong thing if it is incorrect).  I agree that most of the time we don't care if it is correct, as we don't attempt to interpret the file in that encoding.

According to my example with utf-16 bytestream we never break unicode file content
when we read it from disk, only when we create new file on disk. So, every time
we want to write new file content to disk we need to check encoding.

We write new content for files on disk in branch, checkout, pull, local push,
update and merge operations. Also we produce similar content with cat command.
I don't count diff here, for obvious reasons.

>> As a Cyrillic-minded man I can't agree with you that default encoding
>> should be 'ascii'. It's the safe variant, it's
>> true, but it does not help to speed things up.
> 
> It's not about speeding things up, it is about resisting the temptation to guess.  To my mind, it's very similar to Python choosing to use "ascii" as the default encoding with 'error' as the default handling - it is somewhat frustrating at times, but the least error prone decision.

Yes, I understand this analogy with Python.

> However, it is a matter of philosophy - I always take the approach that it's easier to make a correct program fast than it is to make a fast program correct :)  I think I've made my point, so I'm happy to let things rest there...

Sorry, but I still don't understand your proposal.

If we using 'ascii' as default encoding then what?
Any file that has 8th bit set in its content should be refused to use eol conversion?
'ascii' is valid encoding to do blind s.replace('\n', '\r\n'). Yes?
But unicode (utf-16 encoded) files potentially could contains no bytes with upper-bit set.
Yes, I know about BOM, but imagine someone produce utf-16 with some script and don't
put BOM characters. At least for Python interpreter absence of BOM is not error.

So, is default should be: if encoding is not set then refuse to do eol conversion at all?
Is it correct?

In my opinion we should check \0 byte as well, because it most likely that utf-16
encoded file will have at least one \0 inside. Is it correct assumption?



More information about the bazaar mailing list