bzr-revnocache can now cache the merge graph

Wed Jan 21 20:36:45 GMT 2009

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Ian Clatworthy wrote:
> The trunk of bzr-revnocache now lets you configure what
> gets cached: "just" the rev-id=>revno map or the full
> (sorted) merge graph. If you're interested in helping
> to test out the latter, grab the latest code and edit
> the top of __init__.py. (FWIW, you can also disable
> caching altogether if you want as well.)
> 
> Note that the trigger for storing the merge graph is
> currently still Branch.get_revision_id_to_revno_map().
> There is no Branch.iter_merge_revisions() API that
> I can override yet. A MERGE/RFC is in the review queue
> for this.
> 
> As a next step, I'd like to try out caching "mergelines"
> in addition to or instead of the full merge graph. I
> expect a mergeline record to look something like:
> 
>  mergerevno depth length last_rev_id join_rev_id [first_rev_id split_rev_id]
> 
> See my comment in https://bugs.launchpad.net/bzr/+bug/125773
> for an explanation of these terms. If I'm wasting my time going
> down this track, please let me know!

I'm not entirely sure I understand what you are caching. An example
might help. This is the last 3 revisions of bzr.dev:

revno: 3949
revision-id: pqm at pqm.ubuntu.com-20090120210300-641tutf1rkdn8a3n

    revno: 3943.4.5
    revision-id: john at arbash-meinel.com-20090120201745-gkvsfuhc9tf7m1bm

    revno: 3943.4.4
    revision-id: john at arbash-meinel.com-20090116223613-sywet9fbx0589a3z

    revno: 3943.4.3
    revision-id: john at arbash-meinel.com-20090116223224-hspiby4drt1ng7uy

    revno: 3943.4.2
    revision-id: john at arbash-meinel.com-20090116222734-9mgzsmtcvv9qr2mt

    revno: 3943.4.1
    revision-id: john at arbash-meinel.com-20090116221614-k8su03l5d22tq6iv
- ------------------------------------------------------------
revno: 3948
revision-id: pqm at pqm.ubuntu.com-20090120044335-pwr2rshr1yu6vzti

    revno: 3946.1.2
    revision-id:ian.clatworthy at canonical.com-20090120040338-yxg0r2ab45lqtks9
- ------------------------------------------------------------
revno: 3947
revision-id: pqm at pqm.ubuntu.com-20090120032136-alahvfk4g7y8iczn

    revno: 3946.1.1
    revision-id:
ian.clatworthy at canonical.com-20090120021235-n8bd5kzz3s624aq5

    revno: 3943.3.4
    revision-id: amanic at gmail.com-20090118014815-hmowk5wx102lwquz

    revno: 3943.3.3
    revision-id: amanic at gmail.com-20090118011839-qanqndkfqrkek7e1

    revno: 3943.3.2
    revision-id: amanic at gmail.com-20090118010721-2s62eobbgk51qqn9

    revno: 3943.3.1
    revision-id: amanic at gmail.com-20090118004202-nh1b5bpr91ty2rxq

So what would end up in the cache, something like:

3943.4 1 5 john at arbash-meinel.com-20090120201745-gkvsfuhc9tf7m1bm
  john at arbash-meinel.com-20090116221614-k8su03l5d22tq6iv

I guess my question is "mergerevno" the revno this 'line' was merged
into (3949), or is it the prefix on the revisions themselves (3943.4)

What is last, join, first and split?

Also, notice that for 3947, we have a double merge, but both are at
depth 1. We basically have this graph (time flows down):

 A
 |\
 | B
 : |
 | C
 |  \
 D   |
 |\ /
 | E
 |/
 F

A, D, F are the mainline of bzr, B, C was done as amanic, and E is ian's
branch of bzr.dev which then merges amanic's work and submits it.
I believe the depth of B & C are still 1, because E didn't have any
other commits. Contrast that with:

 A
 |\
 | B
 : |
 | C
 |  \
 D   |
 |\  |
 | E |
 | |/
 | F
 |/
 G

The 'merge sort' way of graphing this is actually:

 A
 |\
 D \
 |\ \
 | E |
 | | |
 | | B
 | | |
 | | C
 | |/
 | F
 |/
 G

> 
> At a minimum, I expect this cache will:
> 
> * be a small fraction of the other caches size wise
>   and therefore be orders of magnitude faster to load

I think this could be a good mix of "cache enough to be useful" but
still make use of the parent_map to avoid redundant information. It is
unfortunate that it is difficult to use this to speed up revision_id =>
dotted_revno.

I suppose if you read the whole cache, and track what revision_ids are
present, you might be able to start at a revision, and then walk back
until you find a revision in the cache. Which should give you one end of
the map.

> 
> * speed up lookups of dotted_revno -> rev_id because
>   of this.
> 
> I also wonder whether we can reconstitute the sorted merge
> graph and/or rev_id=>revno map from *just* this information
> efficiently? Thoughts? Sample code to help me write
> 
>   def iter_merge_lines(branch, merge_graph_iter)?
> 
> Perhaps that API together with something like
> 
>   def mergeline_revision_info(branch, mergerevno)
> 
> would be useful in other parts of bzrlib or plugins?
> I suspect knowing where each mergeline got joined into
> the graph (i.e. a child pointer) must be useful in
> some graph walking algorithms, yes?
> 
> In summary, bzr-revnocache continues to grow as a
> proof-of-concept. I think it could nicely complement
> John's lazy-revno calculation branch which focusses
> on efficiently going the other direction, i.e.
> rev-id to revno conversion. As branch is extended
> with APIs like iter_merge_revisions(), performance
> improvements ought to gradually ripple though the
> various commands as the code starts using them (and
> users install this plugin).
> 
> Ian C.
> 
> 

I think this is worth exploring. I'm not sure that it is exactly the way
we want to go, but it might.

Also, I think you need to consider things like the emacs' repository,
which has about 95k revisions along the mainline. Would one of the
"mergelines" be the mainline itself? Doesn't it seem like an mid-sized
cap on the distance would be warranted? (Include every 500th revision
along a line, etc.)

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkl3h10ACgkQJdeBCYSNAAOSVACePKPckeGkPTVOP8UpF9+VMVxJ
WZsAoLzKCq1AYPWzz/7/0iVExeABIIIa
=+TQ5
-----END PGP SIGNATURE-----