my strategy on implementing line-endings (eol) support

Mark Hammond mhammond at skippinet.com.au
Fri Apr 4 01:16:45 BST 2008


I'm just trying to ensure we are on the same page here.  I hope I don't sound pedantic or argumentative...

> Alexander Belchenko ?????:
> > According to my example with utf-16 bytestream we never break unicode
> > file content
> > when we read it from disk, only when we create new file on disk. So,
> > every time
> > we want to write new file content to disk we need to check encoding.
> 
> Err, this statement is not correct. Reading utf-16 file in 'rU' mode
> will convert all \r to \n. It's incorrect.

Only because the Python 2.x file model works with bytes - there is no way to tell the IO system what the encoding is.  This is one of the warts fixed in Python 3.0, which does "universal newline" mode fine for Unicode:

Python 3.0a4+ (py3k, Apr  4 2008, 10:56:16) [MSC v.1500 32 bit (Intel)] on win32
>>> open("\\temp\\delme.txt", "wb").write("hello\r\nworld".encode("utf16"))
26
>>> repr(open("c:\\temp\\delme.txt", encoding="utf16", newline=None).read())
"'hello\\nworld'"

Note how the '\r\n' from then input stream was converted to '\n'.  Although this is a slightly different transform than you want to apply, the concepts are identical.

> So I should admit I was wrong
> about treating unicode files as text files with arbitrary line-endings.

Actually, I think the problem was about "treating Unicode files as *byte streams* with arbitrary line-endings" - treating them as text is no problem - but to treat them as text you must know its encoding!

> I don't see any efficient way to handle eol in unicode files without
> hurting performance, so it's better to follow hg model and disable
> eol-conversion for them, even if user set the 'eol' property to some
> value different from 'exact'.

This isn't my conclusion.  I would suggest there is no efficient way to perform EOL conversion on an ASCII file many megabytes in size.  The solution in that case is to not enable EOL conversion for such files, and I think the same applies here.  There is a threshold where you can't do it effectively for ASCII files - there is a similar threshold for Unicode files, it's just a little lower.  Conversely, I would say that the "average" source file, in any encoding supported by Python, could probably take the hit of the encode and decode without significantly degrading performance.

It seems to me that the EOL conversions are always O(N) - and I think we can remain very close to O(N) for most encodings, especially if we have to process the file as a stream anyway (ie, s.replace() can't really be the impl unless we slurp it entirely into memory, which doesn't seem like an ideal implementation...)
 
Cheers,

Mark




More information about the bazaar mailing list