[RFC] Extract unannotated texts from Knits

Wed Jul 25 20:29:07 BST 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

The attached patch is layered on top of my earlier one for changing the
layering for extracting data from Knits.

Basically, that patch made it easier for the _KnitData and _Knit*Factory code
to only extract what we are going to use (the texts not including the annotations).

The attached patch makes use of that to move all of the extraction code into
Pyrex, and then skip over annotations.

The big win here is that we remove all of the 'string.split()' calls,
especially when we are just throwing away part of the data.

A small win is using zlib directly, instead of going through
tuned_gzip.GzipFile. (I think out of 5s 200ms was saved for gzip extraction).

To give the approximate benefits:

1) Use _get_text_map() instead of _get_context_maps(), about 10% 5s => 4.5s

2) Using direct calls to zlib.inflate() instead of using gzip.GzipFile, about
5% 4.5s => 4.25s.

3) Switching to have extract that can avoid creating PyStrings for annotations
completely. about 20% 4.25s => 3.5s.

So in total, with this patch, I can do:

bzr checkout bzr.dev test

In 3.5s rather than 5s. Approx 1.5x faster.

The bulk of this win is from: "Knit.get_lines()" which has been improved from
3.75s under lsprof to 1.3s with my patch. (2.8x faster)

This code is only RFC because I need to add a bunch more test cases to the
direct tests (test all the error conditions, no lines, too many lines, etc,
etc). Also, I wrote the code so that I can have a fixed sized decompression
buffer, and be able to call zlib.inflate() multiple times. But that has not
been implemented.
So there are a couple edge cases it doesn't handle (a delta block > 1MB
uncompressed, and more difficultly a delta/fulltext *line* > 1MB.).

I'm a little unhappy that 'cp -ar' is in the 0.3s range. Which means that we
are 10x slower than a plain 'cp -ar'.

Profiling seems to put file.close() as 1s (which is the time it is writing out
data). Which is extra strange since cp -ar is able to do it in 300ms. I at
least wonder if cp is able to do things with memmap files, rather than direct
access.

Under lsprof it takes 5s, this breaks down into:

 1023ms file.close() (should be the flush to disk time)
  380ms loading 650 .kndx files.
  280ms extracting 650 fulltexts
  250ms extracting 3341 deltas

  500ms get_revision_graph()
		136ms calling get_parents() 12.5k times
 		250ms calling get_graph() 1 time
          I don't quite understand why we call get_parents directly for each
          revision, and then *also* call get_graph
	  Note that get_graph *also* calls get_parents 15k times
	  In total, we spend 385s in 31.2k get_parents() calls.

  112ms calling file.writelines() 651 times.
  112ms calling lockable_files.unlock() 137 times

  148ms calling Transport.readv() 650 times, which calls
	_seek_and_read 4641 times for a total of 1211 actual seek+read on disk
	What I don't understand is why 4641 == 3341+650+650, for some reason it
        seems like we have an extra 650 reads here.

  316ms in 7 calls to mutable_tree.tree_write_locked()
	At the very least, we seem to be physically locking the tree multiple
	times. I believe this is because we lock it during creation, and then
	another time when we start updating the tree.
	I've spoken before that initialization should probably return a locked
	tree. But I'm not sure if that would propagate around in an obvious
	way.

Another top down view shows that:

  3880ms transform._build_tree
    3241ms new_by_entry
       1888ms get_file_lines (648 calls)
	  570ms _get_weave()
	    520ms KVF.__init__ which is 380ms reading .kndx
	    110ms bzrlib.store.versioned.filename ????
		  escaping the strings seems to be rather expensive
		  30+ms for 650 calls, and we do it 2 times
       1299ms get_lines
	     94ms working out what chunks to read
	    118ms keeping track of delta chains, etc
	    258ms reading the raw data chunks from disk
	    279ms parsing 650 fulltexts
	    260ms parsing 3.3k deltas
	     82ms applying 3.3k deltas to fulltexts
	    143ms verifying the sha1 sum of 650 fulltexts

       1313ms new_file	     (648 calls)
		dominated by 1s for file.close
			  112ms for writelines
			   84ms for open
			   36ms _limbo_name
     350ms apply()
       185ms _apply_insertions()
       102ms tree_write_locked() ??? 100ms for 1 call to lock the tree?

      40ms updating the progress bar
    ~124ms checking to see if files exist before we put a new one there
	   (file_kind, abspath, + misc handling)
      70ms resolving conflicts (when there aren't any), mostly spent in
	   'find_conflicts' (_parent_loops, _overwrite_conflicts,
		             _dupe_entries)

Looking at this, I think:

We need to figure out why tree_lock_write is so expensive. This is between
100-200ms for each time we have to physically lock the tree. That seems really
slow considering it is creating a directory, putting a file there, renaming,
and reading it back. Not to mention that we are doing it multiple times. We
probably spend at least 500ms just locking and unlocking all the different
objects. In that amount of time we can read and parse all of the fulltexts.

Spending 143ms verifying sha1 sums seems reasonable. It is something we could
easily take out, but it isn't a huge portion of the overall time.

.kndx are slow to read. Even with the Pyrex extension, it is still about 400ms.
This probably requires a specific format bump. This may be something that
Robert's work does really well with. We'd have to be a little careful about
pessimising certain cases when we put all entries in a single index. But
locality of reference and a faster-to-parse format may vastly compensate for that.

I'm considering pushing another chunk down, so that instead of reading the
blocks into ram, and then extracting them to lines, and then merging those
lines together to get texts, we can have a Pyrex function which works all of
that out. You give it the index, and ask it to extract these texts.

I don't think applying deltas is really a time sink, though I think that having
to read everything into lines, rather than working on byte streams might be a
problem.

To try and put it into something comparable. Assuming the time to read the data
out of .knit is fixed at 200ms (this is just the seek+read calls that we have
to do, and a small amount of bookkeeping).

       1299ms get_lines
	     94ms working out what chunks to read
	    118ms keeping track of delta chains, etc
	    258ms reading the raw data chunks from disk
	    279ms parsing 650 fulltexts
	    260ms parsing 3.3k deltas
	     82ms applying 3.3k deltas to fulltexts
	    143ms verifying the sha1 sum of 650 fulltexts

We have about 95+118+258+143ms of general overhead that wouldn't really go
away, so about 685ms to extract all the fulltexts and deltas from gzip hunks
into lists, and apply them to get the final output texts.
We have 650 files and 3.3k deltas for an average of 5 deltas per file, and
1.05ms to parse them and apply those 5 deltas.

xdelta can apply 16 deltas to 'builtins.py' in about 61ms. While Mercurial's
bdiff can do it in about 15ms, or 9ms if you do it 'all-at-once'. Including the
time to extract the texts from zlib compression.

If I'm counting correctly, that makes the current Pyrex code very close at
extracting deltas from gzip hunks and applying them as any of the other
possibilities.

The only way we could win with another format is if we can get rid of more of
the 'overhead'. Namely, figuring out what chunks we need to read, what order
they need to be applied in. For bzr.dev code, at least, our _apply_delta is
super fast (munging moderately sized lists is speedy).

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGp6SCJdeBCYSNAAMRAox4AJ4wJ2uF0G0T0Zzjkm0jjm/Scn5OAgCfUDDu
GYjL2iYqrBlyi1+ZDUnRbEY=
=Zvp2
-----END PGP SIGNATURE-----
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: pyrex_knit_extract.patch
Url: https://lists.ubuntu.com/archives/bazaar/attachments/20070725/0df786b5/attachment-0001.diff