brisbane-core+ric-commit results

Thu Mar 26 21:54:42 GMT 2009

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

John Arbash Meinel wrote:
> Ian Clatworthy wrote:
>> Comparing the full python 3.0 repo using latest bzr.dev
>> (1.9 format) against latest brisbane-core (gc-chk255-big format)
>> using my work-in-progress usertest branch, we're looking
>> pretty good now on the whole. John is working on the
>> branch-outside-a-shared-repo case and I believe still has
>> some unlanded stuff that ought to help ls.
> 
>> See attached for the summary results.
> 
>> Ian C.
> 
>> PS: my brisbane-core has Robert's pending commit patch applied
>> (with the necessary subtree-related one line tweak).
> 
> Branch:unshared	67.9	360.6 bzr-btree branch $work_basename
>                                repo/${work_basename}
> 

I just committed 2 patches that speed up "bzr branch bzr-gcc255" from
3m54s to 2m11s. (Not quite 2x faster.) The primary change was to get rid
of small_set.difference_update(large_set), and instead go with small_set
= set(small).difference(large_set). [Note that 'bzr branch bzr.dev' is
1m16s including building a working tree]

It also changed the "bzr branch launchpad --lsprof" time from 44min down
to 26min (2790s => 1577s).

Python doesn't optimize difference_update very well, which we've run
into in the past. small.difference(large) is O(small), while
small.difference_update(large) is O(large).

The second fix is to use something cheaper than _bytes_to_entry() which
has to decode('utf8') various strings.

According to lsprof, we can save about 35s more if we stop calling
iter_interesting_nodes 2 times.

I also think that once we are up into the Launchpad/Python size
repositories, we are thrashing the BTree leaf node cache. The problem is
that we have ~300k nodes in a single btree, and all keys are evenly
spread out.

According to lsprof, we parse 96k btree pages during 'bzr branch
launchpad'. And if I look at "head -n5" I can see that we have:

 3190 .cix
  460 .iix
  462 .rix
  279 .six
 1931 .tix
 ---------
 6322

So we are parsing each btree page approximately 15 times. My guess is
that it is actually all in the .cix, and we are parsing those 30 times
each, and the others we aren't being too violent with. (.tix might
overflow, but it shouldn't matter because we get locality of references
there, as we group the fetch for each file together.)

Note that we get 346870/3190 = 108 chk entries per btree page, and that
our btree page cache size is 1000 nodes. So if you figure there is a
1-in-3 chance of having a given page in cache, 1000/3190 * 108 entries = 34.

There may be other bits going on, as branching 'bzr.dev' is parsing 2415
leaf nodes, but only has 1800 nodes. None of the individual btrees are
more than 1000 nodes, to actually overflow.

This is the sort of thing that we would want to address with a new index
format. In the short term, we could just increase the btree leaf node
cache size during certain operations.

Note that out of all this stuff that we are doing to copy a launchpad
tree, we only spend 1.4s in 328 calls '_get_block' which is actually
reading the compressed content from the .pack file, and 6.1s to
zlib.decompress() the content. Though we spend 19s in zlib.decompress()
to analyze the btree pages. Of the (now) 1577s spent to branch
launchpad, we spend 790s dealing with the index.

So for branching Launchpad:

44m	Old brisbane-core
26m	new brisbane-core
17m	set btree leaf cache to 4k

However, part of this is also because as we are generating the new
content, we are querying it to ensure that the keys are not duplicated.
Also, we are spilling the new CHK nodes to disk, and then combining them.

I tried setting "random_id = True" but I added a check, and for some
very strange reason, we are sending the same key multiple times. It
seems to be one of the last steps that we do, so I'm not sure if there
is a bug in the chk streams (near the end), or if it is a bug with
signatures, or something else.

Regardless, there is still an extra 190s (3 min) spent because we call
"iter_interesting_nodes()" 2 times, instead of once.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAknL+aIACgkQJdeBCYSNAANQPQCgyiavvKroqFadkQYYrtKHG4DD
vZkAn2YtqEmF+kW8E6DwqEszmUlCYm2/
=OJZa
-----END PGP SIGNATURE-----