history-db, now speeding up 'bzr log'

Tue Apr 13 17:33:33 BST 2010

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Vincent Ladeuil wrote:
>>>>>> "jam" == John Arbash Meinel <john at arbash-meinel.com> writes:
> 
>     jam> So I've finally gotten to the point with bzr-history-db that I
>     jam> hook into bzrlib internals, and can see an actual effect on
>     jam> commands.  lp:~jameinel/+junk/bzr-history-db
> 
> Great !
> 
> <snip/>
> 
>     jam> At this point, I now hook into:
> 
>     jam>   Branch._do_dotted_revno_to_revision_id
>     jam>   Branch._do_revision_id_to_dotted_revno
>     jam>   Branch.iter_merge_sorted_revisions
>     jam>   Branch.hooks['post_change_branch_tip']
> 
> It would be nice to use real hooks here for all the features the plugin
> provide. That would ensure we provide the right hooks too :)
> 

Sort of. Providing alternative data storage isn't something I generally
consider 'hookable'. We can certainly think about it, but providing a
registry of where people can find history information doesn't seem to
quite fit.

Also, these methods were introduced (IIRC) by Ian specifically so that
they could be overridden. Which means that we already have a 'hook'
point, override these methods. :)

>     jam> The plugin is currently quite verbose, though we can easily
>     jam> decrease this by commenting out 'trace.note()' calls, or
>     jam> turning them into mutter().
> 
> Or a config variable with verbosity level ?

Sure. Or -Dhistorydb or ... Get it working, get it right, get it fast :)

> 
>     jam> Results:
> 
>     jam> 1) We could probably get rid of a bunch of the special casing logic in
>     jam>    'log.py' that was trying to avoid loading the merge_sorted graph.
> 
> Amen to that.
> 
> Yet, I consider the experiment to still be in progress and I'd like some
> clarifications first:
> 
> - how big is the database so far (and how easy is it to set it up so
>   that it's shared between branches, I know you thought about it, I'd
>   like a summary as of today),

It depends how fully populated your db is. --expand-all is still
expensive. But unlikely to be done often.

My shared db with bzr, mysql, and emacs trunk branches is ~60MB. Just
bzr.dev is about 5MB. It might grow a bit, because I think I need
another column to ensure that the results are returned in the correct
order. (Right now insert order == select order, but I don't think you
can guarantee that.)

However, as a proof-of-concept wanting to switch to BTreeIndexes, we
have more flexibility in what and how we store data.

> 
> - can you backport some or all of your work to core already (you
>   mentioned an order of magnitude optimisation in a previous mail, is
>   the plugin strictly required for that and if not what is missing in
>   core),

Generally, everything requires the cache in order to work. The
partial-merge-sort operations assume you can find out what a previous
revision merged. Put in graph terms:

  A
  |\
  B C
  |/|
  D E
  |/
  F

When I go to number E, I assume that I can 'cheaply' get: D merged C
with dotted revno 1.1.1.

The major optimizations involved 'incremental-merge-sort', which is
something core doesn't support at all. Right now, merge_sorting the
entire bzr.dev history is <=40ms, I don't think we can do much better
than the existing algorithm.

> 
>     jam> 2) 'time bzr log -n0 -r -10..-1 bzr.dev' is
>     jam>    1.210s => 0.771s
> 
>     jam> 3) 'time bzr log -n0 -r -10..-1 mysql-6.0' is
>     jam>    2.312s => 0.959s
> 
>     jam> 4) 'time bzr log -n0 -r -1000..-990 mysql-6.0' is
>     jam>    2.552s => 1.243s
> 
>     jam>   In looking at the verbose commentary, you can also see this:
>     jam>     history_db rev=>dotted took 0.032s, 0.026s to init, 0.006s to query
>     jam>     history_db iter_merge took 0.050s (0.050s query, 0.000s filter)
> 
>     jam>   Which means that iter_merge + dotted_revno lookups are really only
>     jam>   about 80ms out of that 1200ms overall time.
> 
>     jam> 5) 'time bzr log -n0 -r -100..-1 emacs'
>     jam>    3.890s => 0.784s
> 
>     jam>    Note, however that:
>     jam>    'time bzr log -n0 -r -10..-1 emacs'
>     jam>    0.614s => 0.769s
> 
>     jam>    However, this seems to be an artifact of using --no-plugins. 'bzr
>     jam>    rocks' on my machine is 150ms faster with --no-plugins.
> 
> All these results sound very promising. I'd like to correlate them with
> the space costs though.

Here are some size numbers versus the amount of data in the tables.

 8.8MB all of bzr.dev, 30,575 revs
	= 301 bytes per revision
52.0MB expanding 4.7k branches adding 794k new entries
	= 9.4kB per branch
	= 57 bytes per dotted_revno entry

20MB mysql-6.0, 68,814 revs
	= 304 bytes per rev

33MB for emacs trunk 99.8k revisions
	= 346 bytes per revision
38MB expanding all 48 branches adding 77k new entries
	= 104kB per branch
	= 66 bytes per entry

> 
>     jam> From what I can tell, the big time lost is now that 'cmd_log'
>     jam> uses 'in_history(b)' rather than 'as_revision_id(b)'. Which
>     jam> means that Branch.revision_history() always gets called.
> 
> Chapter and verse ? Is this fixable ?

This was an incorrect early analysis. The time spent seems mostly just
bzr startup overhead, with and without plugins having a large effect.
'bzr rocks' is > 500ms, and saves 150ms with --no-plugins.

So having bzr log be 700ms, means that all of the actual
loading-data-and-logging it is under 200ms.

> 
>     jam> I can also say that the impact on other operations for updating
>     jam> the history-db cache is pretty minimal. Usually fully accounted
>     jam> for by other operations that we are doing.
> 
> Not a big surprise but it's good to confirm.

Without having incremental update, I don't think this would have been
the case, but I'm not positive. (get_known_graph can be 1-2s, though.)

> 
>     jam> As an example, merging bzr.dev into an old branch, and then
>     jam> doing 'commit' takes 15s but only 0.400ms to import. So about
>     jam> 2.6% overhead.  Now, doing the same a heavyweight checkout
>     jam> seems to take >3s. My guess is that the import is being
>     jam> triggered on the remote branch, rather than the local one
>     jam> (because its branch tip is changing first). I don't have a
>     jam> great answer for that yet.
> 
>     jam> 'bzr push' is pretty fast. It adds 1 round-trip to check the remote
>     jam> branch.conf to figure out if the db is enabled, and then sees that the
>     jam> revisions are already imported.
> 
>     jam> Anyway, with the current results, I'm pretty sure I've proven
>     jam> 'proof-of-concept'.
> 
> From an external point of view, yes you did :)
> 
>     jam> There is still some polishing and edge-case work to
>     jam> do. However, I think it has at least shown what I hoped to
>     jam> show.
> 
> A summary will be welcome then, mentioning:
> 
> - are you targeting lp and loggerhead mainly,

That was the original goal. Though certainly I was hoping to learn from
this experiment, and hopefully some of that can be brought in as core data.

> 
> - do you plan to land this plugin in core or keep it for experiments and
>   land bits and pieces without addind a db dependency,

I really don't like the db dependency, especially for something in core.
However, BTreeGraphIndex isn't a great solution, either. SQL has some
really nice bits, like being able to express your queries at a high
level, and someone else has written well-tuned code to pull together
data from multiple indexes and data tables.

I'm also curious how much stuff like "map revision_id => INTEGER"
effects the results. I'm pretty sure it would have a large effect on
total data size. A 4-byte integer versus an arbitrary length (~60byte)
string. Also your idea about using (gdfo, rev_id) to get clustering. By
using rev_id => INTEGER, we can chose any clustering we like. (Currently
clustered based on merge_sort results.) I'd like to do something like
force an order (sorted(revision_ids)), and see if that effects
performance much.

> 
> Also, it would be very useful to quantify your algorithms with respect
> to:
> - the number of nodes processed,
> 

Some of the data is present already in the '_stats' members of Querier
and Importer, some of which is mutter()ed.

It sort of depends which algorithms you are most interested in. I can
certainly describe the SQL queries :). And there are some tricks there.
Like I don't have to read in all of the dotted_revno data until I've
found a tip revision I care about. And I can read in large ranges of
data using the 'mainline_parent_range' tables.

For example, pulling in the merge_sorted data once we've found a tip
revision we are interested in is:

1) Check the 'mainline_parent_range' table to see if we get a range with
   ~100 mainline nodes.
   SELECT pkey, tail FROM mainline_parent_range WHERE head = ?

2a) If not present, do a single select on the dotted revno table, and
    step the parent pointer. (Join against the 'revision' table so that
    you get revision_ids back.)

      SELECT db_id, revision_id, merge_depth, revno, end_of_merge
        FROM dotted_revno, revision
       WHERE tip_revision = ? AND db_id = merged_revision

    Step the parent pointer:
      SELECT parent FROM parent WHERE child = ? AND parent_idx = 0
    (essentially get_parent_map([c])[c][0]

2b) If present, use mainline_parent to pull in 100 parents, and grab the
    dotted revno information for them

      SELECT db_id, revision_id, merge_depth, revno, end_of_merge
        FROM dotted_revno, revision, mainline_parent
       WHERE db_id = merged_revision
         AND tip_revision = mainline_parent.revision
         AND mainline_parent.range = ?

    The new tip and range were found in step 1.

This 1/2 step is repeated in a variety of queries. For example, when
doing 'bzr log -r X..Y' we do a quick step to find Y, which doesn't
actually bring in the dotted data. Specifically:

     SELECT 1 FROM dotted_revno, mainline_parent
      WHERE merged_revision = ?
        AND mainline_parent.range = ?
        AND tip_revision = mainline_parent.revision

Essentially, you are just looking for the existance of a (tip_revision,
merged_revision) pair.

Also note that there is a 'UNIQUE' constraint on (tip_revision,
merged_revision), which means there is an INDEX on that as well. As
such, we should be able to read the 100 revisions from the
mainline_parent table, and then look them up in the tip_revision table,
and filter quickly to see if 'merged_revision' is present.

Another place that shows up is the "dotted_revno => revision_id" mapping
lookup. Where we do:

   SELECT revision_id, revno
     FROM dotted_revno, mainline_parent, revision
    WHERE tip_revision = mainline_parent.revision
      AND merged_revision = revision.db_id
      AND mainline_parent.range = ?
      AND revno IN (?, ?, ?, ?)

If you want to do revision_id => dotted_revno, we just change that last
line:

   SELECT revision_id, revno
     FROM dotted_revno, mainline_parent, revision
    WHERE tip_revision = mainline_parent.revision
      AND merged_revision = revision.db_id
      AND mainline_parent.range = ?
      AND merged_revision IN (?, ?, ?, ?)

I did notice that the dotted => revision_id takes ~50ms, while
revision_id => dotted takes about 5ms. I'm pretty sure it is because of
the (tip_revision, merged_revision) index.

I wonder if sqlite knows how to use just part of an index, to still
efficiently look up tip_revision, even though it now has to read the
whole page to see if 'revno' matches.

> - the number of physical IOs (excluding cache trickery but showing how
>   you group nodes in your queries),
> 
> Congrats again,
> 
>          Vincent
> 

I'm not entirely sure how you would want to measure this. Though there
is stuff like "EXPLAIN SELECT ..." which gives you a step-by-step of how
sqlite will be generating the results. (Personally I find the postgresql
output much more understandable, as IIRC sqlite just gives the VM bytecode.)

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkvEnNwACgkQJdeBCYSNAAPBSQCgrWN+gwPdCyHKGlGCoq3MnLNa
6yQAoK0fMMmu2vhfle2toMvdcBE+ee/i
=yaOc
-----END PGP SIGNATURE-----