New toy for bzr developers: B+Tree index sketch

Tue Jul 1 20:07:16 BST 2008

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Collins wrote:
| On Tue, 2008-07-01 at 13:17 +1000, Robert Collins wrote:
|> I have been tinkering with a B+Tree approach to the disk indices. I
|> prefer this over a hash based index because of the ability to do
|> in-order traversal when desired.
|>
|> In short, I think we can save substantial size with the currently
|> developed B+Tree and it allows addressing some of the memory problems
|> Andrew has spent some time debugging.
|
| I have turned this into a repository implementation - currently only
| handling plain repositories - but its easy enough to extend to rich
| roots and subtrees if there is interest in looking at this.
|
| lp:~lifeless/+junk/bzr-index2
|
| Known defects:
|  - iter_entries_prefix is not yet implemented (but its not needed in
| bzr.dev either)
|  - multi-key searches do not use the index as well as it can be to
| reduce IO. In particular we just bisect each time.
|
| I'd love to see Ian's usertest run against --btree-plain, and any other
| low-human-time to perform benchmarks. For instance, bzr's repository
| size has been compared against gits, seeing what this does to help with
| such comparisons might be nice, on a bzr-svn converted repository
| (though this needs a rich-root format - see above). I'll toss one of
| those in tomorrow probably.
|
| As a reminder:
|
http://bazaar-vcs.org/RobertCollins#head-b2dc9ae46f1860c46f79682db455c3512152d148
| has my notes on improving the index layer. This btree index -should-
| address:
|  size (roughly half as big as our current indices)
|  performance (theoretically much cheaper find-key logic)
|  memory utilisation (fixed size lru cache)
|
| To compare two indices - (using launchpad as a handy, sizable code base):
|  - btree : 7.8M
|  - graphindex : 22M
|
| A linear scan across all the keys in the indices:
|  - btree : 3.554 seconds
|  - graphindex : 13.013 seconds
|
| Random access - grab all the keys, permute, then access each key
individually once:
|  - btree : 297 seconds (100 page cache)
|  - btree : 95 seconds (1000 page cache)
|  - graphindex : 25 seconds
|
| Not tested yet:
|  - high latency (NFS or Remote or http) links

Did you have simple scripts to make it easy to reproduce your results?

|
| I think the btree random access is so much slower because it is totally
| unoptimised. One thing is that is always bisecting to find the page of a
| key. If we had a plain dict of keys to probe in first, we could avoid
| that extra work. We could further eliminate the leaf nodes from the page
| cache (so getting more hits on locator nodes), but that would then
| require an IO on every key miss; so I'd prefer to not do that. Another
| possibility would be to put the keys from a leaf node into a plain dict,
| and have a callback on leaf node removal from the _node_cache to remove
| those keys from the dict.
|
| Clearly the raw access to all keys is cheaper, so it should be possible
| to have random access be roughly the same (modulo cache thrashing
| situations - and we should try to avoid those from higher up in the
| code).
|
| -Rob
|

I've been following the commit messages (though I *really* wish I could
have landed my earlier proposal so I could read them in "threaded"
mode....).

It certainly looks interesting.

I'm curious why you settled on 4k pages. I suppose when I was tuning
Dirstate bisecting, 4k pages was the sweet spot. I seems like larger
pages gets you a bigger win for compression, though.

For example, I thought zlib uses 64k as its compression window. (bzip2
uses 900K, which is certainly >> 4k).

I did a quick test, where I grabbed the list of revision ids, and just
add them one at a time to a compressor, until it outputs something. I
had to add 1323 revision ids, with total length of 67,390 bytes, before
the compressor spit out 21,104 bytes compressed.

If I change it to use a Z_SYNC_FLUSH after every revision_id, I get
67,390 => 35,872 bytes. So we get 2:1 compression instead of 3:1
compression. If I grab the whole set of revision ids from bzr, and
compress them all with zlib, I get 3.1:1 compression. If I do it with
bz2, I get 3.5:1 compression.

I wonder if we don't want to just buffer a certain amount of input data,
and then just do a full (fresh) compression each time. Doing:

~  "c = zlib.compress(''.join(rev_ids[:200]))"

Takes about .4msec per loop. For the whole set:

~  "c = zlib.compress(''.join(rev_ids))"

Takes 82.7msec (for 23,030 revision ids). (If we were compressing them
200-at-a-time it should be only 46 msec, so the bulk of the time seems
to be compressing-long-strings overhead.)

Your notes seem to say it should compress approx every 40 rows (to get
4k pages). So if we delayed until 30 rows, and then compressed until we
hit the maximum, we might expect say 20 rows out of every 50.
If we bump up the overhead, that is 20ms for every 50 rows. Or an
average of .4ms per row.

Looking at bzr.dev we have ~84k rows across all indices, so it costs an
extra 33s to do 'optimize' the compression. 'time bzr branch bzr.dev'
takes a total of 1m3.4s locally. So adding 33s would be a large overhead.

In my quick and dirty attempt, if I start checking compression when

1m08s  original code
1m29s  bytes_seen > chunk_size
1m25s  bytes_seen > 1.8 * chunk_size

However, there is a considerable savings on disk space.

du -ksh .bzr/repository/indices

with Robert's version gives 4.7MB, with my version gives 3.3MB. (about
1.4:1, or 30-40% savings.)

I think the big win is I don't ever try to "guess" if there is going to
be wasted space. I just assume it is going to work.

Actually, I thought about it, and I can still use the
"compressobj.flush(Z_SYNC_FLUSH)" as a first-pass check, and if it
fails, fallback. In doing so the time was:

1m16s  compressobj, but try extra pack passes

The final indices are only 3.3MB in size. YAY!

Attached is a patch. Unfortunately, it breaks quite a few of your
higher-level tests, because now that I'm packing more efficiently, you
don't actually spill over like you are expecting. I didn't bother to fix
those tests, but I *did* fix the tests for the chunk_writer.

With the final version I have posted here, the time to do "bzr branch
bzr.dev" into a btree-plain repository is

1m08.4s  original
1m14.3s  extra_packed

So it is adding around 6s extra time to do the extra packing checks. But
with 70% of the total disk space, that ends up being a pretty big win,
IMO. That is a considerable increase in information density, meaning
less data to download, fewer round-trips, etc.

If we want, we can add back in one of the "while size_in < chunk_size"
checks. I took it out because I thought the code was clearer, but it
does shave 1s off the total time (1m13s). While this is probably within
the error margin, it is 1s out of the 6s overhead added by this
repacking. I guess I need a better test which *just* times the pack
indices steps, to eliminate the extra noise.

John
=:->

PS> It would all be so much simpler if we could just snapshot the state
of the compressor. Do the flush, restore the state, etc. zlib itself
supports it with deflateCopy(). Though it does warn:

... Note that deflateCopy duplicates the internal compression state
which can be quite large, so this strategy is slow and can consume lots
of memory.

I also tried switching to 8192-sized pack files. And with my code, it
drops the size to 3.2MB. So about a 3% savings. Certainly not nearly as
much as just trying to pack more into the existing 4096 bytes.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkhqgGQACgkQJdeBCYSNAANtmwCfYxz/zOr/6etn04KPgRWBNerF
DKsAoJ0fb3tHP0RPsmnc5eWE5392Yo+H
=yaeu
-----END PGP SIGNATURE-----
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: pack_harder.patch
Url: https://lists.ubuntu.com/archives/bazaar/attachments/20080701/5ba6c1f5/attachment-0001.diff