B+Tree indices: ongoing progress

John Arbash Meinel john at arbash-meinel.com
Wed Jul 2 04:33:19 BST 2008


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Collins wrote:
| On Tue, 2008-07-01 at 21:59 -0500, John Arbash Meinel wrote:
|> So... If you trust that we will always get at least 2:1 compression
|> (it
|> seems reasonable given the amount of entropy in our revision_ids),
|> then
|> I can cut the time down quite a bit, and retain *most* of the saving.
|>
|> (9b) gives 4,612,096 versus 6,172,672 without repacking, is 1:1.34,
|> instead of the maximum 1:1.4.
|>
|> Note that this is also one of those places where you would probably
|> like
|> "bzr pack" to set max_repacks = 100, and have it go to town.
|>
|> Interestingly, if I set compression_level = 9 I get exactly the same
|> compression.
|
| re safety:
| As long as in the event of an overrun it *at worst* asserts during
| writing, I'm _ok_. To be happy I'd like it to always work no matter what
| (short of 4K long values :P). I expect that (bzr-git revids will be very
| high entropy and thus unlikely to get 2:1 compression.

As in raises an Assert?

And, IIRC we actually do a hex(git_sha1) which gives us at least 2:1
because of the expansion to hex digits. (len(x)*2 == len(hex(x)).

It is trivial to error out in .finish() if compressed bytes is > chunk_size.

And further, the caller *could* reset itself, because we have been
tracking the 'bytes_in'.

If you are okay with plain 'punt for now', then I'm very happy to bring
it in.

|
| re: compression level - thats because of the window size - we're dealing
| in a very small stream anyway. We can drop the compression but we can't
| raise it using zlib.
|
| re: performance - I found while tuning commit to be hg-speed that we
| really need to eliminate fat all the way through the process. At the end
| of that process I was seeking 0.1 second wins throughout the code base.
| (On a 20K file tree - or 1/4 the size of this data set. So any expansion
| here is something we will need to counteract to prevent commit()
| becoming slower).

Well, is this a commit that triggers a repack, or a plain commit.
Because with only 1 node being added to a fresh pack file, we are
unlikely to hit any of the node packing code.

|
| And some good news:
| Baseline overhead: Read 92994 keys in 7.075
| GraphIndex: Wrote 10147567 bytes in 10.949
| GraphIndex: iter_all_entries in 9.671
| GraphIndex: iter_random_one in 11.915
| GraphIndex: get_components_positions(2000) in 20.535
| GraphIndex: search_from_revid in 6.516
| GraphIndex: miss torture in 343.138
| GraphIndex: -------Done---------
| Baseline overhead: Read 92994 keys in 7.072
| BTreeIndex: Wrote 4104336 bytes in 20.299
| BTreeIndex: iter_all_entries in 1.850
| BTreeIndex: iter_random_one in 115.365
| BTreeIndex: get_components_positions(2000) in 33.332
| BTreeIndex: search_from_revid in 4.581
| BTreeIndex: miss torture in 41.803
| BTreeIndex: -------Done---------
|
| Note the search_from_revid and miss torture results.
|
| -Rob


The miss torture is very nice to see. And graph traversal is a decent
way of showing some of that.

I'll see if I can't fix up your iter_random_one(), though :).

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkhq9v8ACgkQJdeBCYSNAANEmACgsA2bxhycXR5GWAFGsmKGZZMv
SowAnAvR8zfQc/jh5JAYdoAA28lmi16P
=n0IV
-----END PGP SIGNATURE-----



More information about the bazaar mailing list