VF.insert_record_stream() with duplicates in stream

John Arbash Meinel john at arbash-meinel.com
Fri Mar 27 14:45:46 GMT 2009


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I've been looking at the performance of "bzr branch" outside of a shared
repository for GC format repositories. We would like to get away from
having a very specific code path, and switch to using the more generic
"insert_record_stream(get_record_stream())", so I'm trying to optimize
that code path.

I found that I can save 26m => 14m if I set random_id=True (when
branching all of Launchpad). This changes it so that we don't check to
see if a given key already exists in the target index.

This seemed like a "safe" change to me, because it is a little bit silly
for "fetch" to generally send you the same object twice. However, after
having done that, I encounter a test for this exact case:

ERROR: test_insert_record_stream_existing_keys
(bzrlib.tests.test_versionedfile.TestVersionedFiles)

Which does 2 inserts into the same VF, where the records in the second
insert are redundant with the ones in the first.

So I'm at a bit of a loss for how to recover here. Normally checking to
see if a key already exists in the target VF isn't very expensive. (it
comes down to a dict lookup.)

However, once a btree has spilled to disk, it then requires a btree
search and paging in the contents from disk. And when we combined
spilled indices, the final target index would overflow the cache size,
causing us to thrash.

I was able to get most of the performance back if I set the btree cache
size to 4000 (since we only have 3200 btree pages). Though at that
point, you are keeping in memory all the bits that you just flushed to
disk. And we would be far better off just setting "spill_at=400k".

The other option is to not combine spilled indexes. Because each index
holds 100k keys, which fits in each individual btree cache size. But yet
again, we are caching *all* of the keys in memory, just doing it in an
even less optimal way.

I suppose one option would be to promote "random_id=True" up to the
'insert_record_stream()' level, and have code that knows it is giving a
unique stream declare it as such.

I wouldn't fret so much if branching wasn't 2x faster by disabling the
check.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAknM5poACgkQJdeBCYSNAAN/EgCdF+vzEKknCx5AINDCsIF4pxY8
BBoAoJwiXXESU48BUFZcmKaSAyFQbfbE
=VmEw
-----END PGP SIGNATURE-----



More information about the bazaar mailing list