[PATCH] KnitIndex optimizations

Tue Nov 21 14:02:42 GMT 2006

John Arbash Meinel wrote:
> It shouldn't be too hard to create a knit index with several thousand
> entries, and then you can benchmark the time it takes to read it a bunch
> of times. (Say 1000-10000 entries, and open the file 100 times). We
> would like the benchmark to take ~1s, since that makes it reproducible
> without spending too much time on a single benchmark.

Ugh, I've just tried to benchmarking the code but then I've realized there was 
a bug in the original code - version ids doesn't decoded from UTF-8. So now I 
need to write a tests, fix the code and I expect no speedup with the fixed 
code... :-(

[skip]
> Actually 'RevisionNotPresent' raises based on file_id, not filename, so
> the error raised should be the same.
> 
> Also, this isn't a user-facing error, it is an error that indicates we
> have some sort of corruption or internal error, so it doesn't have to
> look pretty to the user.

Ok

>> - Maybe _load_data() should use file's iteration protocol instead of the
>> readlines() method for parse index file? A file iterator use read-ahead
>> buffer so it should be faster than file.readline() and should consume
>> less memory (for big files) while be a little bit slower than
>> file.readlines(). For the _load_data() method profiling tells about 10%
>> speedup for 'for line in fp' instead of 15% for 'for line in
>> fp.readlines()'.
> 
> Remember that the file handle given here isn't always a plain file().
> Sometimes it is the return value from urllib.open(), and sometimes it is
> an sftp file.

Yes, I know.

> http probably always does a full request, and can stream the data back,
> so I would guess 'for line in fp' would be fine.
> 
> sftp might be bad in that situation, since when you use 'fp.readline()'
> it actually does only an 8K read, which means a round trip for every 8k.
>  Now, I may be missing an optimization, which seems to be that doing
> SFTPTransport.get() requests a prefetch of the entire file. So it may be
> that it is already spooling all of the data. If that is the case, then
> for line in fp should be fine.
> 
> For the smart transport, we also just spool all of the data before
> returning, so it shouldn't be a problem there either. This is one place
> we should start looking into being able to only read *some* of the data
> before yielding it. But I digress.

Ok, so conclude 'for line in fp' will be fine in most cases?

> v- You should at least do the pb.update() inside the try/finally in case
> an exception is raised, the pb will still be finished correctly.

So the general rule looks like the following snippet?

pb = create_progress_bar()
try:
     pb.update(...)
finally:
     pb.finished()

> Ultimately, I have a patch which removes this, though. Because we
> actually spend a lot of time with this, and it doesn't really help the
> progress bar in the general sense.

Should some high level UI code handle all progress bar operations..?

> v- I don't believe we follow this structure in 'if' blocks. It is
> unnatural to most people, and python doesn't support:
> 
> if value[0] = '.':
> 
> It actually raises a SyntaxError. So the workaround isn't actually needed.

Actually I haven't changed this line so it was someone before me with the 
strong C background. :-) I'll update it in the new version of the patch.

>> +                if '.' == value[0]:
> 
> and here

In this line I've changed the string's index value and made it consistent with 
the line above. BTW I think the _parse_parents() method can be safely removed, 
I don't see any clarity in leaving outdated, untested and unmaintained methods.

>>          result = []
>>          for value in compressed_parents:
>> -            if value[-1] == '.':
>> +            if '.' == value[0]:
[skip]
> So I'm generally positive about the changes (+0.8 or so). If you can
> write a benchmark that shows these are actual improvements, rather than
> just --lsprof improvements I would be happier.

But the patch unexpectedly face another sort of problems... :-)

-- 
Dmitry Vasiliev (dima at hlabs.spb.ru)
     http://hlabs.spb.ru