my strategy on implementing line-endings (eol) support

Mark Hammond mhammond at skippinet.com.au
Thu Apr 3 07:20:58 BST 2008


> Yes, of course. But decoding all bytestreams to unicode blindly will be
> major slowdown for operations like status and diff and commit.

It should not need to be done for all bytestreams, or for all operations, but I see no alternative than to say it *must* be done, when performing eol-conversion, for all files with an encoding property.

But what happens to the user who checks in a utf16 file, sets eol-style to native, but neglects to tell bzr about the encoding?  I doubt we want to write an invalid UTF16 file for them - but if we do, and they attempt to correct their error by setting the encoding to utf-16 (even though it's now invalid utf-16), what happens?  It sounds messy, and IMO we should have failed early.

It seems safer to me to *enforce* any binary string with an unknown encoding holds only ascii bytes when eol processing.  It will only happen when the user has explicitly asked for the file to be treated as text, and we already processing the entire body anyway, so it seems the additional hit is marginal.
 
> I'm planning to let user decide what encoding has content and then set
> appropriate file property, called "encoding".

That sounds good.  So getting back to the initial point:

> 3) text files with native/LF/CRLF/CR line-endings
> 4) unicode text files similar to 3.

I'm suggesting that it would be less error prone to treat a missing encoding as meaning "ascii", and if you do, 3 and 4 become effectively identical.
 
Cheers,

Mark




More information about the bazaar mailing list