[PATCH] KnitIndex optimizations
Dmitry Vasiliev
lists at hlabs.spb.ru
Tue Nov 21 14:02:42 GMT 2006
John Arbash Meinel wrote:
> It shouldn't be too hard to create a knit index with several thousand
> entries, and then you can benchmark the time it takes to read it a bunch
> of times. (Say 1000-10000 entries, and open the file 100 times). We
> would like the benchmark to take ~1s, since that makes it reproducible
> without spending too much time on a single benchmark.
Ugh, I've just tried to benchmarking the code but then I've realized there was
a bug in the original code - version ids doesn't decoded from UTF-8. So now I
need to write a tests, fix the code and I expect no speedup with the fixed
code... :-(
[skip]
> Actually 'RevisionNotPresent' raises based on file_id, not filename, so
> the error raised should be the same.
>
> Also, this isn't a user-facing error, it is an error that indicates we
> have some sort of corruption or internal error, so it doesn't have to
> look pretty to the user.
Ok
>> - Maybe _load_data() should use file's iteration protocol instead of the
>> readlines() method for parse index file? A file iterator use read-ahead
>> buffer so it should be faster than file.readline() and should consume
>> less memory (for big files) while be a little bit slower than
>> file.readlines(). For the _load_data() method profiling tells about 10%
>> speedup for 'for line in fp' instead of 15% for 'for line in
>> fp.readlines()'.
>
> Remember that the file handle given here isn't always a plain file().
> Sometimes it is the return value from urllib.open(), and sometimes it is
> an sftp file.
Yes, I know.
> http probably always does a full request, and can stream the data back,
> so I would guess 'for line in fp' would be fine.
>
> sftp might be bad in that situation, since when you use 'fp.readline()'
> it actually does only an 8K read, which means a round trip for every 8k.
> Now, I may be missing an optimization, which seems to be that doing
> SFTPTransport.get() requests a prefetch of the entire file. So it may be
> that it is already spooling all of the data. If that is the case, then
> for line in fp should be fine.
>
> For the smart transport, we also just spool all of the data before
> returning, so it shouldn't be a problem there either. This is one place
> we should start looking into being able to only read *some* of the data
> before yielding it. But I digress.
Ok, so conclude 'for line in fp' will be fine in most cases?
> v- You should at least do the pb.update() inside the try/finally in case
> an exception is raised, the pb will still be finished correctly.
So the general rule looks like the following snippet?
pb = create_progress_bar()
try:
pb.update(...)
finally:
pb.finished()
> Ultimately, I have a patch which removes this, though. Because we
> actually spend a lot of time with this, and it doesn't really help the
> progress bar in the general sense.
Should some high level UI code handle all progress bar operations..?
> v- I don't believe we follow this structure in 'if' blocks. It is
> unnatural to most people, and python doesn't support:
>
> if value[0] = '.':
>
> It actually raises a SyntaxError. So the workaround isn't actually needed.
Actually I haven't changed this line so it was someone before me with the
strong C background. :-) I'll update it in the new version of the patch.
>> + if '.' == value[0]:
>
> and here
In this line I've changed the string's index value and made it consistent with
the line above. BTW I think the _parse_parents() method can be safely removed,
I don't see any clarity in leaving outdated, untested and unmaintained methods.
>> result = []
>> for value in compressed_parents:
>> - if value[-1] == '.':
>> + if '.' == value[0]:
[skip]
> So I'm generally positive about the changes (+0.8 or so). If you can
> write a benchmark that shows these are actual improvements, rather than
> just --lsprof improvements I would be happier.
But the patch unexpectedly face another sort of problems... :-)
--
Dmitry Vasiliev (dima at hlabs.spb.ru)
http://hlabs.spb.ru
More information about the bazaar
mailing list