[RFC] optimizing bzr-grep

Tue Mar 16 23:59:15 GMT 2010

On 17 March 2010 01:27, Parth Malwankar <parth.malwankar at gmail.com> wrote:
> Hello,
>
> I am working on optimizing bzr-grep searching specific revs[1].
> I managed to get the time down from ~33s to ~23s for specific
> rev search (e.g. -r last:1, not revision range). To get this down
> further my experiments show that majority of the time is now
> spent in:
>    file_text = tree.get_file_text(id)
>
> So, if grep takes ~23s, merely commenting out the above line
> brings the time down to ~1.5s.
>
> [emacs-bzr]% time bzr grep -r last:10 ffo > /dev/null
> bzr grep -r last:10 ffo > /dev/null  19.19s user 3.77s system 99% cpu
> 23.054 total
> [emacs-bzr]% time bzr grep -r last:10 ffo > /dev/null
> bzr grep -r last:10 ffo > /dev/null  1.07s user 0.20s system 89% cpu 1.421 total
>
> Is there anything I can do to speedup getting the full text of
> a revision?

Well, if by commenting this line out you're grepping a 0-byte string
it wouldn't be surprising if it's fast :-)

You should make sure you're holding a read lock on the whole
repository for the whole time, so that things can be cached. -Drelock
may help.

Using log+file://.... for the repository may indicate inefficient IO.

Using iter_file_bytes may be faster, or even better iter_files_bytes
will let the repository choose a more efficient order.  This will also
let you check for binaries inline with grepping.

It may be faster to grep the whole thing as a string before splitting
it into lines.

Use --lsprof.

Compare the time to grep a revision to the time to export it.

hth

>
> Another optimization comes to mind. bzr-grep checks the
> first 1024 bytes of the file text before rejecting it as binary
> or accepting it as text as continuing further. However, in order
> to do the above check I still read the whole file using get_file_text.
> I see this as an issue especially for large binary files.
>
> Is there a API that would allow me to get just 1KByte chunk,
> e.g. chunk = tree.get_file_text(id, size=1024)?
> This way, if the tree has many binary files they won't be read
> fully into the memory before getting rejecting, saving space
> and time.
>
> I would appreciate any suggestion or comments.
>
> Regards,
> Parth
>
> [1] https://bugs.launchpad.net/bzr-grep/+bug/539429
>
>

-- 
Martin <http://launchpad.net/~mbp/>