Caching 'dotted_revnos' and merge_sort
John Arbash Meinel
john at arbash-meinel.com
Sat Apr 3 00:27:49 BST 2010
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
I've been toying around with some ideas about how to make history
management a bit better in bzr. Specifically focusing on dotted_revnos
and some other aspects that are currently 'whole ancestry' operations.
I've started experimenting with it by defining some SQL schemas that we
could use, and prototyping the ideas in an sqlite database. I think most
of this could be extended into bzr-specific btree indices if we want to
avoid using sqlite for it.
I've gotten far enough to determine that some of the tricks I had
thought about really do seem to work. The plugin is currently available
here:
lp:~jameinel/+junk/bzr-history-db
The primary observation that motivated the work was:
The merge_sort / dotted_revno graph is stable based on tip_revision.
All children of a given revision share that part of the revision_id =>
dotted_revno map.
If we record the 'new' dotted_revnos for each revision (vs its
left-hand parent), then we should be able to 'efficiently'[a] cache
this graph between many branches.
The main downside is that you can't query it trivially for a given
branch, because you now need the intersection of mainline_ancestry(tip)
with the dotted revnos. So the next observation was a possible method
for making mainline_ancestry(tip) reasonably efficient by recording
ranges of ancestry. (say 100 revs at a time), which is *also*
overlapping between branches.
This turns out to work quite well. It also turns out to enable
optimization of a lot of queries that I didn't expect at first. See the
attached file for descriptions of the various data tables, and a few
more details about data structure and why.
Some observations of what I have so far:
1) Import is not fast enough to be 'live' for first import, but its
decent from there on.
a) 'import bzr.dev' takes ~5s on my laptop
b) 'update w/ bzr/2.0' takes <2s on my laptop
c) 1.3s is taken up just grabbing KnownGraph.merge_sort() at the
beginning. (which is O(all_ancestry).) If we already have a cache,
and we know what part of the cache we can use, computing the rest
should be O(changes). I've already written some of the code to do
so, but originally wrote it trying to only have an in-memory cache
(and lazily number the rest.) Which didn't do very well.
2) Using SQL really makes data storage convenient. Just define some
tables, throw on some indexes, etc.
3) SQLite isn't particularly compact. I imported all the branches in my
bzr repo (which includes a bunch of plugins). For 47k revisions, 60k
parent relations, and 80k dotted revnos, it takes up 15MB.
For comparison, .bzr/repository/indices is 15MB, and that includes
all of .[cirst]ix for 31 pack files (155 index files).
My whole bzr repo is 95MB in size. So the cache is a rather large
fraction of the data. And it isn't indexing anything about texts,
just the revision graph.
4) This is similar in principle to what 'historycache' was trying to do.
However, there is still the 'primary observation' difference, which
means the data scales to *multiple* branches, rather than just
holding a cache for one branch. It also means that the cache doesn't
need to be rebuilt from scratch when a branch changes, etc.
5) SQLite does seem pretty fast to startup. TIMEIT says starting just
python is 113ms, importing sqlite and doing a count(*) query on the
revision table is 145ms. (127ms to import and 127ms to import and
connect, so 18ms for the count(*))
6) The basic storage of the 'dotted_revno' table could certainly be done
in a single BTreeGraphIndex. However, for queries you want to 'be
fast', you then need another index on them, with all the associated
disk-overhead. Also, indexes currently aren't structured to point to
eachother, so you would store the data you really wanted in a pack
file, and then have multiple pointers into that.
7) SQLite's database lookups are faster than BTreeGraphIndex. Whether it
is because it is in pure C, because it isn't zlib compressing the
pages, or handling multiple write-once indices, etc. However, the
tables are also defined in db-specific integers. Once you map them
back to revision-ids, some bits get pretty slow.
8) Some numbers for 'branch.revision_history()':
646ms using bzrlib, and a pack repo with 20 pack files
406ms a fresh repo with only bzrlib in it
296ms using the database, and queries that join the
revision_id=>db_id map (so you query it with revision_ids)
And walking the 'parent' table
243ms using the db, using raw db_id queries, walking the parent
table
18ms using the db, using db_id queries, walking the
'mainline_parent_range' table.
Not terribly surprising that using a table that denormalizes the data
is faster than using the fully normalized form. However, for ~8 bzr
branches in the database, there are only 5661 rows in
mainline_parent, which is a small growth over the 5131 to hold just
bzr. (Again it will depend on how 'close' the branches are, etc.)
9) Some numbers for 'branch.repository.ancestry()':
1429ms using branch.repo.graph().iter_ancestry() w/ 20 packs
573ms using branch.repository.revisions.get_known_graph_ancestry()
1740ms using database, and queries using raw revision-ids, stepping
one-revision at a time
200ms using db, and queries having:
SELECT parent FROM parent WHERE child in (?*)
And passing up to 100 db_ids at a time. (sqlite has a limit of
999 parameters, but I found 100 to be reasonable in among
1,2,5,10,20,50,100,200,500).
163ms using db, and mainline_parent to find the mainline first, then
a similar by-100 loop to fill in the rest of the ancestors.
50ms intersecting the mainline_parent table against the
dotted_revno table.
So there is a diff of about 30x performance between the
best-and-worst methods. Also, you have to have imported a given
branch before the last method will actually return correct results,
whereas the parent will give the correct results for all possible
revisions.
10) There are probably also other factors at play worth investigating.
Specifically, the way I generate the revision_id => db_id mapping
means that there is probably a fair amount of locality. (numbers are
assigned as you walk the merged-sorted graph.) This is similar in
principle to what Vincent wanted from a (gdfo, revision_id) index.
Though I think locality is that revisions are close to the one that
merged it, not close to its parent.
Anyway, some next steps are to try it with stuff like mysql, and to add
more queries against the data structure to see what works and what
doesn't. I think the basic concept has a lot of merit, especially for
stuff like loggerhead.
John
=:->
[a] If you fully populate the dotted_revno table, then you end up with 1
tip revision per revision in the ancestry * the number of revisions
merged into that tip. So the limit is O(N_ancestors * N_avg_merged).
N_avg_merged is certainly capped at N_ancestors, though that would
lead to O(N^2) which isn't great. However, there are mitigating
(and complicating) factors.
i) Average size of merged often depends on the size of feature
branches, but you have to consider people merging trunk back into
their feature branch as well.
ii) Still, I would expect it to be a small number *on average*. With
many revisions having no merges, and a small handful hitting
into the 1000s.
iii) You don't have to fully populate the dotted_revno for every
possible branch ancestry. Probably just for the 'currently
active' branches.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
iEYEARECAAYFAku2fXUACgkQJdeBCYSNAAPGTQCbBc6LiilrXtpozDIzWA8Utldx
vgIAoKthqJFKLnHhO9AkWwRd2Y7WuXaw
=MK8y
-----END PGP SIGNATURE-----
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: history_denormalization.txt
Url: https://lists.ubuntu.com/archives/bazaar/attachments/20100402/6eb3a517/attachment-0001.txt
More information about the bazaar
mailing list