HPSS graph-difference algorithm

Thu Apr 5 22:45:41 BST 2007

Robert Collins wrote:
> One common task we have is to determine the revisions that the server
> has for a graph, that the client does not.
> 
> I've prototyped an approach that I believe will do very well in high
> latency situations: From my house to London, for the bzr repository, I
> think it can deliver a compact description of the needed revisions in 4
> round trips; sending under 64Kb of data (the revision.kndx is 1.2Mb for
> me). It takes my code about 5 seconds to do the processing of the data
> to achieve this.
> 
> The traffic breakdown looks like:
> request  sent bytes  recieved bytes
> 0                57           16211
> 1                94           27745
> 2             11516            1191
> 3               221              50
> 
> Using lftp to get revisions.kndx on the hpss branch takes me:
> 1403989 bytes transferred in 34 seconds (40.3K/s) 
> And thats after connecting :(.

Well, it certainly is an improvement. I'm curious if we can do better
than 5s, though. Are you saying it is startup + processing time for 5s,
or is that including the round-trips, or what?

> 
> I'm skipping most of the details of the algorithm in this email - its
> detailed in 
> http://codebrowse.launchpad.net/~lifeless/bzr/graphdelta/annotate/robertc%40robertcollins.net-20070403074302-fo3x80f961hh7e7f?file_id=server.txt-20070401020531-rqa7d77i22idnj78-2
> 

I read that in your commit messages, and the messaging protocol section
wasn't very clear.

Specifically, I don't understand things like:
Server Graph: {A:[], B:[A], C:[A], D:[B, C]}
Client Graph: {A:[], B:[A], E:[B]}

Client message 1:
get_missing
response:
unknown=(HEADS=[], STOPS=[])
sample=(0, [A, D])
client message 2:
get_missing((HEADS=[], STOPS=[]), (0, [True, False]))
response:
unknown=(HEADS=[B, C], STOPS=[A])
sample=(0, [B, C])
client message 3:
get_missing((HEADS=[B,C], STOPS=[A]), (0, [True, False]))
response:
'complete', missing=(HEADS=[], STOPS=[B, A])

Mostly I don't understand the (0, [True, False])

What is the '0'? I'm guessing it is saying:
you returned HEADS=[B, C], STOPS=[A], and requested a sample of (0, [B, C]).
Of which I have B but not C.

So I'm returning the HEADS and STOPS that you sent (so you can compute
the 'sample' again, and know what I do and don't have.

...

> I'm interested in peer review of this; so far Martin and Andrew have had
> me bend their ear with detailed blow-by-blow accounts; and their good
> questions and suggestions have influenced it a lot. I dont think having
> multiple folk coding on this particular feature will help deliver it
> much faster; folk that want to help should help Andrew with the merge of
> the bulk of the hpss branch; which this work will depend on.
> 
> My goal for 0.16 with this is to make pull and push with with this to
> the extent that determining what revision ids to pull is much faster;
> for 0.17 we'll want a dedicated bundle format for this sort of scenario.
> 
> -Rob
> 

I'm certainly trying to follow along. I've been a bit confused about
what HEADS, STOPS, and sample are.

I've tried going through your test cases to understand GraphDelta, but I
think one side had

f - e - c - b - a
      \   /
        d

(f being the first commit, and a being the newest)

And the other graph had

f - e - c

And you gave heads == c, and stops == f

I would have thought that 'e' would be an important node. (Moreso than 'f').

I tried to sort things out with graphviz, and graphs like:
digraph merge_end {
  a -> b
  b -> {c d}
  c -> e
  d -> e
  e -> f

  { edge [color="#ff0000"]
          c -> e
          e -> f
  }

  c [style=filled, fillcolor="#00ff00"]
  f [style=filled, fillcolor="#0000ff"]
}

(One of my big complaints was that I wasn't able to get graphviz to
invert the direction. And I'm used to having children at the bottom, not
the top. But rankdir=TB did the same thing as rankdir=BT (even though
they seem to claim different things.

I'm not sure if you can explain it in a non-interactive session, though.
So maybe we should just plan for a phone call or at least irc chat.

John
=:->

PS> I think having code to make graph discovery fast is a great idea. I
know we talked about it a bit in Sydney, and I think it is really good
for this sort of thing.

The other thing I would mention, though. Is that most of the time you
will have an entry point into the graph. (Because you are opening a
Branch, and you want to get the revisions for that Branch that you don't
have).

Branch6 doesn't include the 'revision-history' (which is really a good
thing), but it still means that your first request doesn't have to be:

get_missing

it can be

get_missing(tip=A)

John
=:->