[RFC] New generic graph storage format

John Arbash Meinel john at arbash-meinel.com
Fri Sep 7 23:32:22 BST 2007


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1



...

>>> Within a single block, the sorted keys/hash pairs is stored as
>>> fixed-size rows to make searching for a key which we know is
>>> present fast, and not require an additional indexing step. This is
>>> sensible when there are many keys with very similar lengths, which
>>> is the case for bzr. For the Python implementation, I will likely
>>> put the keys/hashes into a dictionary, but for an efficient C
>>> version we can directly binary search the bytes because they are
>>> sorted.
>> We profiled doing bisection in python and while I don't know how many
>> records one needs, it is faster to avoid parsing all the data we do not
>> need up front. Just a note :). Profiling as always is the canonical
>> answer.
> 
> Was bisection in Python reasonably fast?

Short answer, yes. There are some factors, but even python's bisect.bisect_left
is actually written in C (as of python2.4 look at the bottom of bisect.py where
it imports the pre-compiled ones).

It depends what you are comparing it against. There is code inside
bzrlib/dirstate.py which does pure-python bisection of an on-disk format. It
can bisect for multiple keys at a time, and bisects by reading in pages,
parsing the first and last record in the page, and then deciding what to do
with the current list of keys. (Are they before the first, after the last, or
somewhere in the page).

It handles variable sized records, as long as they use clear separators. (We
use '\n' as the end of a record and '\0' as a field separator.)

A common case is to have a single parent (so that you have 2x"columns" of data).

To read all the entries in the file looked like:
test_read_kernel_0_parent         OK   107ms/  430ms
test_read_kernel_1_parent         OK   218ms/  736ms
test_read_kernel_2_parent         OK   342ms/ 1059ms
test_read_kernel_3_parent         OK   431ms/ 1193ms

To bisect for a subset looked like:
test_bisect_kernel_1_parent       OK     7ms/  549ms
	(find 7 records, fairly spread out in the file)

test_bisect_kernel_1_parent_dense OK     3ms/  505ms
	(find 5 records, all close to eachother)
	


So if you compare the time to read the whole file (218ms), to the time to
extract a few of the records (7ms), I would say that python bisection is still
pretty fast. (31:1)

It depends a lot of the data source (dirstate was designed explicitly to
leverage what python is good at, namely splitting strings into lists, and then
splicing lists into other lists).

These specific tests are also very bit-rotted versus the current dirstate design.

Now, is bisection in python faster than in C? Well we have benchmarks for
bisecting through the dirstate in memory. This builds an in-memory tree with 10
top level directories, each with 10 subdirs, with 10 subdirs, with 10 subdirs,
and 1 file at the bottom level. And then it bisects for every record.

test_bisect_dirblock_c                       OK       49ms/    3879ms
test_bisect_dirblock_cached_py               OK      872ms/    4781ms
test_bisect_dirblock_py                      OK      826ms/    4786ms

So a C implementation can be approx 17x faster than the python one. (Which is
why we have a pyrex implementation of it.) The biggest win, though, is because
of how keys need to be compared. The way dirstate is sorted, you can't just
compare 2 entries directly, you have to do some splitting, etc. So the pyrex
version has a nice compare that can work directly on the strings, without
building any intermediates.

So anyway, I just wanted to say that depending on what you are doing, algorithm
is extremely important (31x improvement), but for some things so can writing a
low-level function (17x improvement).

But generally, a low-level function ends up as a fixed improvement, while the
algorithm changes the way things scale. (So it may be only a 2x improvement
when you have 10 records, but it is a 100x improvement when you have 10,000.)

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFG4dF1JdeBCYSNAAMRAh15AKC4XXihYyz2mMMMezUWGgz2ElHsuwCgnBhZ
n2HAOOc5gVZ8m3n28QYC5lQ=
=FGeL
-----END PGP SIGNATURE-----



More information about the bazaar mailing list