[MERGE] More updates for fileids_altered_by....

Mon Dec 18 17:58:56 GMT 2006

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

The attached patch has 1 more update for knit parsing. This one should
specifically help any time we are extracting data from a knit, and I was
quite surprised with how much it helped.

Basically, it refactors _KnitData._parse_record and _parse_record_header
to use a helper function. So now instead of parse_record calling
_parse_record_header to read one line, and then calling readlines() on
the rest of the gzip chunk, it just calls readlines(), and passes the
first chunk to _check_header().

In my timing tests it changes the time to run
'fileids_altered_by_revision_ids' from 6.4s down to 5.9s (approx 500ms
improvement ~= 8.5%)

I had done some profiling, and basically the GzipFile object doesn't
handle readline terribly well. (It reads in small chunks, has to do a
lot of searching and parsing, etc).

I sort of wish we didn't have to create a whole GzipFile and StringIO()
every time we want to completely decode a gzip buffer, but at the
moment, it seems to be the only way.

Just to give people a heads up about how my modifications have been
affecting things, this is what my benchmarking reports. All of this is
the time to run 'fileids_altered_by_revision_ids' on the complete bzr
history in my repository (which has a few plugins and such mixed in).

16.364s - 0.13
 7.346s - with only the iter_lines_added uses 'set()' patch (2176)
 7.152s - with regex updates (2194)
 6.434s - with no annotations patch (2196?)
 5.783s - removing extra .readline() call

So the big save was using set(), but I've managed to knock off another
1.6s (~27%).

And the best part about this last fix, is it should affect all the other
times when we are extracting data out of knits.

Current profiling shows that fileids_altered takes a total of 19s. 5.5s
of that are spent directly in fileids_altered. The bulk of that because
we are looping over around 170,000 lines, and checking if they are in
the desired set. We spend about 5s there, 0.6s finding the right entry
in the result dict, another 0.6s adding the revision the specific
file_id set, and 1.8s looking up both the file id and revision id
decoding in a dictionary.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFhtbfJdeBCYSNAAMRAqRTAJ9T11dbqXq3sIlfZA/s8nrTpBeyvQCfeAm7
m6e6o0cB2LvTGj9qRcc1AKI=
=buHs
-----END PGP SIGNATURE-----
-------------- next part --------------
A non-text attachment was scrubbed...
Name: knit_data_parse_record.patch
Type: text/x-patch
Size: 2931 bytes
Desc: not available
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20061218/07cb00c1/attachment.bin