Caching 'dotted_revnos' and merge_sort

Sat Apr 3 00:27:49 BST 2010

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I've been toying around with some ideas about how to make history
management a bit better in bzr. Specifically focusing on dotted_revnos
and some other aspects that are currently 'whole ancestry' operations.
I've started experimenting with it by defining some SQL schemas that we
could use, and prototyping the ideas in an sqlite database. I think most
of this could be extended into bzr-specific btree indices if we want to
avoid using sqlite for it.

I've gotten far enough to determine that some of the tricks I had
thought about really do seem to work. The plugin is currently available
here:

 lp:~jameinel/+junk/bzr-history-db

The primary observation that motivated the work was:

  The merge_sort / dotted_revno graph is stable based on tip_revision.
  All children of a given revision share that part of the revision_id =>
  dotted_revno map.

  If we record the 'new' dotted_revnos for each revision (vs its
  left-hand parent), then we should be able to 'efficiently'[a] cache
  this graph between many branches.

The main downside is that you can't query it trivially for a given
branch, because you now need the intersection of mainline_ancestry(tip)
with the dotted revnos. So the next observation was a possible method
for making mainline_ancestry(tip) reasonably efficient by recording
ranges of ancestry. (say 100 revs at a time), which is *also*
overlapping between branches.

This turns out to work quite well. It also turns out to enable
optimization of a lot of queries that I didn't expect at first. See the
attached file for descriptions of the various data tables, and a few
more details about data structure and why.

Some observations of what I have so far:

1) Import is not fast enough to be 'live' for first import, but its
   decent from there on.
   a) 'import bzr.dev' takes ~5s on my laptop
   b) 'update w/ bzr/2.0' takes <2s on my laptop
   c) 1.3s is taken up just grabbing KnownGraph.merge_sort() at the
      beginning. (which is O(all_ancestry).) If we already have a cache,
      and we know what part of the cache we can use, computing the rest
      should be O(changes). I've already written some of the code to do
      so, but originally wrote it trying to only have an in-memory cache
      (and lazily number the rest.) Which didn't do very well.

2) Using SQL really makes data storage convenient. Just define some
   tables, throw on some indexes, etc.

3) SQLite isn't particularly compact. I imported all the branches in my
   bzr repo (which includes a bunch of plugins). For 47k revisions, 60k
   parent relations, and 80k dotted revnos, it takes up 15MB.
   For comparison, .bzr/repository/indices is 15MB, and that includes
   all of .[cirst]ix for 31 pack files (155 index files).

   My whole bzr repo is 95MB in size. So the cache is a rather large
   fraction of the data. And it isn't indexing anything about texts,
   just the revision graph.

4) This is similar in principle to what 'historycache' was trying to do.
   However, there is still the 'primary observation' difference, which
   means the data scales to *multiple* branches, rather than just
   holding a cache for one branch. It also means that the cache doesn't
   need to be rebuilt from scratch when a branch changes, etc.

5) SQLite does seem pretty fast to startup. TIMEIT says starting just
   python is 113ms, importing sqlite and doing a count(*) query on the
   revision table is 145ms. (127ms to import and 127ms to import and
   connect, so 18ms for the count(*))

6) The basic storage of the 'dotted_revno' table could certainly be done
   in a single BTreeGraphIndex. However, for queries you want to 'be
   fast', you then need another index on them, with all the associated
   disk-overhead. Also, indexes currently aren't structured to point to
   eachother, so you would store the data you really wanted in a pack
   file, and then have multiple pointers into that.

7) SQLite's database lookups are faster than BTreeGraphIndex. Whether it
   is because it is in pure C, because it isn't zlib compressing the
   pages, or handling multiple write-once indices, etc. However, the
   tables are also defined in db-specific integers. Once you map them
   back to revision-ids, some bits get pretty slow.

8) Some numbers for 'branch.revision_history()':

   646ms  using bzrlib, and a pack repo with 20 pack files
   406ms  a fresh repo with only bzrlib in it
   296ms  using the database, and queries that join the
          revision_id=>db_id map (so you query it with revision_ids)
          And walking the 'parent' table
   243ms  using the db, using raw db_id queries, walking the parent
          table
    18ms  using the db, using db_id queries, walking the
          'mainline_parent_range' table.

   Not terribly surprising that using a table that denormalizes the data
   is faster than using the fully normalized form. However, for ~8 bzr
   branches in the database, there are only 5661 rows in
   mainline_parent, which is a small growth over the 5131 to hold just
   bzr. (Again it will depend on how 'close' the branches are, etc.)

9) Some numbers for 'branch.repository.ancestry()':

   1429ms using branch.repo.graph().iter_ancestry() w/ 20 packs
    573ms using branch.repository.revisions.get_known_graph_ancestry()
   1740ms using database, and queries using raw revision-ids, stepping
          one-revision at a time
    200ms using db, and queries having:
           SELECT parent FROM parent WHERE child in (?*)
          And passing up to 100 db_ids at a time. (sqlite has a limit of
          999 parameters, but I found 100 to be reasonable in among
          1,2,5,10,20,50,100,200,500).
    163ms using db, and mainline_parent to find the mainline first, then
          a similar by-100 loop to fill in the rest of the ancestors.
     50ms intersecting the mainline_parent table against the
          dotted_revno table.

   So there is a diff of about 30x performance between the
   best-and-worst methods. Also, you have to have imported a given
   branch before the last method will actually return correct results,
   whereas the parent will give the correct results for all possible
   revisions.

10) There are probably also other factors at play worth investigating.
    Specifically, the way I generate the revision_id => db_id mapping
    means that there is probably a fair amount of locality. (numbers are
    assigned as you walk the merged-sorted graph.) This is similar in
    principle to what Vincent wanted from a (gdfo, revision_id) index.
    Though I think locality is that revisions are close to the one that
    merged it, not close to its parent.

Anyway, some next steps are to try it with stuff like mysql, and to add
more queries against the data structure to see what works and what
doesn't. I think the basic concept has a lot of merit, especially for
stuff like loggerhead.

John
=:->

[a] If you fully populate the dotted_revno table, then you end up with 1
    tip revision per revision in the ancestry * the number of revisions
    merged into that tip. So the limit is O(N_ancestors * N_avg_merged).
    N_avg_merged is certainly capped at N_ancestors, though that would
    lead to O(N^2) which isn't great. However, there are mitigating
    (and complicating) factors.
    i) Average size of merged often depends on the size of feature
       branches, but you have to consider people merging trunk back into
       their feature branch as well.
    ii) Still, I would expect it to be a small number *on average*. With
        many revisions having no merges, and a small handful hitting
        into the 1000s.
    iii) You don't have to fully populate the dotted_revno for every
         possible branch ancestry. Probably just for the 'currently
         active' branches.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAku2fXUACgkQJdeBCYSNAAPGTQCbBc6LiilrXtpozDIzWA8Utldx
vgIAoKthqJFKLnHhO9AkWwRd2Y7WuXaw
=MK8y
-----END PGP SIGNATURE-----
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: history_denormalization.txt
Url: https://lists.ubuntu.com/archives/bazaar/attachments/20100402/6eb3a517/attachment-0001.txt