[RFC] optimizing bzr-grep

Tue Mar 16 20:59:07 GMT 2010

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Parth Malwankar wrote:
> Hello,
> 
> I am working on optimizing bzr-grep searching specific revs[1].
> I managed to get the time down from ~33s to ~23s for specific
> rev search (e.g. -r last:1, not revision range). To get this down
> further my experiments show that majority of the time is now
> spent in:
>     file_text = tree.get_file_text(id)
> 
> So, if grep takes ~23s, merely commenting out the above line
> brings the time down to ~1.5s.
> 

Note that if you don't have any file texts, it is going to be a lot
faster to grep through them as well. Considering you don't have any text
to check...

> Another optimization comes to mind. bzr-grep checks the
> first 1024 bytes of the file text before rejecting it as binary
> or accepting it as text as continuing further. However, in order
> to do the above check I still read the whole file using get_file_text.
> I see this as an issue especially for large binary files.

No, not at this point. The storage engine always extracts the full
content anyway, so it doesn't really matter.

Anyway, if extracting texts really is a slow point, one option is to
switch to using:

  tree.iter_files_bytes()

This was designed as a way to favor extraction speed. Specifically, it
doesn't guarantee a return order, rather it tries to return the
requested content in whatever it considers the fastest-to-extract order.
(well, it is technically undefined, but done so that we can optimize it
as we want.)

The default implementation just iterates over get_file_text() but
RevisionTree.iter_files_bytes() knows better.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkuf8RsACgkQJdeBCYSNAAPz9gCeNszvNamZBz9eSp2QfFSnOIML
yDgAnRgjrMcER+C/iuNbjbs83ZVux4YQ
=ArHP
-----END PGP SIGNATURE-----