[RFC] Extract unannotated texts from Knits
John Arbash Meinel
john at arbash-meinel.com
Wed Jul 25 20:29:07 BST 2007
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
The attached patch is layered on top of my earlier one for changing the
layering for extracting data from Knits.
Basically, that patch made it easier for the _KnitData and _Knit*Factory code
to only extract what we are going to use (the texts not including the annotations).
The attached patch makes use of that to move all of the extraction code into
Pyrex, and then skip over annotations.
The big win here is that we remove all of the 'string.split()' calls,
especially when we are just throwing away part of the data.
A small win is using zlib directly, instead of going through
tuned_gzip.GzipFile. (I think out of 5s 200ms was saved for gzip extraction).
To give the approximate benefits:
1) Use _get_text_map() instead of _get_context_maps(), about 10% 5s => 4.5s
2) Using direct calls to zlib.inflate() instead of using gzip.GzipFile, about
5% 4.5s => 4.25s.
3) Switching to have extract that can avoid creating PyStrings for annotations
completely. about 20% 4.25s => 3.5s.
So in total, with this patch, I can do:
bzr checkout bzr.dev test
In 3.5s rather than 5s. Approx 1.5x faster.
The bulk of this win is from: "Knit.get_lines()" which has been improved from
3.75s under lsprof to 1.3s with my patch. (2.8x faster)
This code is only RFC because I need to add a bunch more test cases to the
direct tests (test all the error conditions, no lines, too many lines, etc,
etc). Also, I wrote the code so that I can have a fixed sized decompression
buffer, and be able to call zlib.inflate() multiple times. But that has not
been implemented.
So there are a couple edge cases it doesn't handle (a delta block > 1MB
uncompressed, and more difficultly a delta/fulltext *line* > 1MB.).
I'm a little unhappy that 'cp -ar' is in the 0.3s range. Which means that we
are 10x slower than a plain 'cp -ar'.
Profiling seems to put file.close() as 1s (which is the time it is writing out
data). Which is extra strange since cp -ar is able to do it in 300ms. I at
least wonder if cp is able to do things with memmap files, rather than direct
access.
Under lsprof it takes 5s, this breaks down into:
1023ms file.close() (should be the flush to disk time)
380ms loading 650 .kndx files.
280ms extracting 650 fulltexts
250ms extracting 3341 deltas
500ms get_revision_graph()
136ms calling get_parents() 12.5k times
250ms calling get_graph() 1 time
I don't quite understand why we call get_parents directly for each
revision, and then *also* call get_graph
Note that get_graph *also* calls get_parents 15k times
In total, we spend 385s in 31.2k get_parents() calls.
112ms calling file.writelines() 651 times.
112ms calling lockable_files.unlock() 137 times
148ms calling Transport.readv() 650 times, which calls
_seek_and_read 4641 times for a total of 1211 actual seek+read on disk
What I don't understand is why 4641 == 3341+650+650, for some reason it
seems like we have an extra 650 reads here.
316ms in 7 calls to mutable_tree.tree_write_locked()
At the very least, we seem to be physically locking the tree multiple
times. I believe this is because we lock it during creation, and then
another time when we start updating the tree.
I've spoken before that initialization should probably return a locked
tree. But I'm not sure if that would propagate around in an obvious
way.
Another top down view shows that:
3880ms transform._build_tree
3241ms new_by_entry
1888ms get_file_lines (648 calls)
570ms _get_weave()
520ms KVF.__init__ which is 380ms reading .kndx
110ms bzrlib.store.versioned.filename ????
escaping the strings seems to be rather expensive
30+ms for 650 calls, and we do it 2 times
1299ms get_lines
94ms working out what chunks to read
118ms keeping track of delta chains, etc
258ms reading the raw data chunks from disk
279ms parsing 650 fulltexts
260ms parsing 3.3k deltas
82ms applying 3.3k deltas to fulltexts
143ms verifying the sha1 sum of 650 fulltexts
1313ms new_file (648 calls)
dominated by 1s for file.close
112ms for writelines
84ms for open
36ms _limbo_name
350ms apply()
185ms _apply_insertions()
102ms tree_write_locked() ??? 100ms for 1 call to lock the tree?
40ms updating the progress bar
~124ms checking to see if files exist before we put a new one there
(file_kind, abspath, + misc handling)
70ms resolving conflicts (when there aren't any), mostly spent in
'find_conflicts' (_parent_loops, _overwrite_conflicts,
_dupe_entries)
Looking at this, I think:
We need to figure out why tree_lock_write is so expensive. This is between
100-200ms for each time we have to physically lock the tree. That seems really
slow considering it is creating a directory, putting a file there, renaming,
and reading it back. Not to mention that we are doing it multiple times. We
probably spend at least 500ms just locking and unlocking all the different
objects. In that amount of time we can read and parse all of the fulltexts.
Spending 143ms verifying sha1 sums seems reasonable. It is something we could
easily take out, but it isn't a huge portion of the overall time.
.kndx are slow to read. Even with the Pyrex extension, it is still about 400ms.
This probably requires a specific format bump. This may be something that
Robert's work does really well with. We'd have to be a little careful about
pessimising certain cases when we put all entries in a single index. But
locality of reference and a faster-to-parse format may vastly compensate for that.
I'm considering pushing another chunk down, so that instead of reading the
blocks into ram, and then extracting them to lines, and then merging those
lines together to get texts, we can have a Pyrex function which works all of
that out. You give it the index, and ask it to extract these texts.
I don't think applying deltas is really a time sink, though I think that having
to read everything into lines, rather than working on byte streams might be a
problem.
To try and put it into something comparable. Assuming the time to read the data
out of .knit is fixed at 200ms (this is just the seek+read calls that we have
to do, and a small amount of bookkeeping).
1299ms get_lines
94ms working out what chunks to read
118ms keeping track of delta chains, etc
258ms reading the raw data chunks from disk
279ms parsing 650 fulltexts
260ms parsing 3.3k deltas
82ms applying 3.3k deltas to fulltexts
143ms verifying the sha1 sum of 650 fulltexts
We have about 95+118+258+143ms of general overhead that wouldn't really go
away, so about 685ms to extract all the fulltexts and deltas from gzip hunks
into lists, and apply them to get the final output texts.
We have 650 files and 3.3k deltas for an average of 5 deltas per file, and
1.05ms to parse them and apply those 5 deltas.
xdelta can apply 16 deltas to 'builtins.py' in about 61ms. While Mercurial's
bdiff can do it in about 15ms, or 9ms if you do it 'all-at-once'. Including the
time to extract the texts from zlib compression.
If I'm counting correctly, that makes the current Pyrex code very close at
extracting deltas from gzip hunks and applying them as any of the other
possibilities.
The only way we could win with another format is if we can get rid of more of
the 'overhead'. Namely, figuring out what chunks we need to read, what order
they need to be applied in. For bzr.dev code, at least, our _apply_delta is
super fast (munging moderately sized lists is speedy).
John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQFGp6SCJdeBCYSNAAMRAox4AJ4wJ2uF0G0T0Zzjkm0jjm/Scn5OAgCfUDDu
GYjL2iYqrBlyi1+ZDUnRbEY=
=Zvp2
-----END PGP SIGNATURE-----
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: pyrex_knit_extract.patch
Url: https://lists.ubuntu.com/archives/bazaar/attachments/20070725/0df786b5/attachment-0001.diff
More information about the bazaar
mailing list