HPSS graph-difference algorithm
John Arbash Meinel
john at arbash-meinel.com
Thu Apr 5 22:45:41 BST 2007
Robert Collins wrote:
> One common task we have is to determine the revisions that the server
> has for a graph, that the client does not.
>
> I've prototyped an approach that I believe will do very well in high
> latency situations: From my house to London, for the bzr repository, I
> think it can deliver a compact description of the needed revisions in 4
> round trips; sending under 64Kb of data (the revision.kndx is 1.2Mb for
> me). It takes my code about 5 seconds to do the processing of the data
> to achieve this.
>
> The traffic breakdown looks like:
> request sent bytes recieved bytes
> 0 57 16211
> 1 94 27745
> 2 11516 1191
> 3 221 50
>
> Using lftp to get revisions.kndx on the hpss branch takes me:
> 1403989 bytes transferred in 34 seconds (40.3K/s)
> And thats after connecting :(.
Well, it certainly is an improvement. I'm curious if we can do better
than 5s, though. Are you saying it is startup + processing time for 5s,
or is that including the round-trips, or what?
>
> I'm skipping most of the details of the algorithm in this email - its
> detailed in
> http://codebrowse.launchpad.net/~lifeless/bzr/graphdelta/annotate/robertc%40robertcollins.net-20070403074302-fo3x80f961hh7e7f?file_id=server.txt-20070401020531-rqa7d77i22idnj78-2
>
I read that in your commit messages, and the messaging protocol section
wasn't very clear.
Specifically, I don't understand things like:
Server Graph: {A:[], B:[A], C:[A], D:[B, C]}
Client Graph: {A:[], B:[A], E:[B]}
Client message 1:
get_missing
response:
unknown=(HEADS=[], STOPS=[])
sample=(0, [A, D])
client message 2:
get_missing((HEADS=[], STOPS=[]), (0, [True, False]))
response:
unknown=(HEADS=[B, C], STOPS=[A])
sample=(0, [B, C])
client message 3:
get_missing((HEADS=[B,C], STOPS=[A]), (0, [True, False]))
response:
'complete', missing=(HEADS=[], STOPS=[B, A])
Mostly I don't understand the (0, [True, False])
What is the '0'? I'm guessing it is saying:
you returned HEADS=[B, C], STOPS=[A], and requested a sample of (0, [B, C]).
Of which I have B but not C.
So I'm returning the HEADS and STOPS that you sent (so you can compute
the 'sample' again, and know what I do and don't have.
...
> I'm interested in peer review of this; so far Martin and Andrew have had
> me bend their ear with detailed blow-by-blow accounts; and their good
> questions and suggestions have influenced it a lot. I dont think having
> multiple folk coding on this particular feature will help deliver it
> much faster; folk that want to help should help Andrew with the merge of
> the bulk of the hpss branch; which this work will depend on.
>
> My goal for 0.16 with this is to make pull and push with with this to
> the extent that determining what revision ids to pull is much faster;
> for 0.17 we'll want a dedicated bundle format for this sort of scenario.
>
> -Rob
>
I'm certainly trying to follow along. I've been a bit confused about
what HEADS, STOPS, and sample are.
I've tried going through your test cases to understand GraphDelta, but I
think one side had
f - e - c - b - a
\ /
d
(f being the first commit, and a being the newest)
And the other graph had
f - e - c
And you gave heads == c, and stops == f
I would have thought that 'e' would be an important node. (Moreso than 'f').
I tried to sort things out with graphviz, and graphs like:
digraph merge_end {
a -> b
b -> {c d}
c -> e
d -> e
e -> f
{ edge [color="#ff0000"]
c -> e
e -> f
}
c [style=filled, fillcolor="#00ff00"]
f [style=filled, fillcolor="#0000ff"]
}
(One of my big complaints was that I wasn't able to get graphviz to
invert the direction. And I'm used to having children at the bottom, not
the top. But rankdir=TB did the same thing as rankdir=BT (even though
they seem to claim different things.
I'm not sure if you can explain it in a non-interactive session, though.
So maybe we should just plan for a phone call or at least irc chat.
John
=:->
PS> I think having code to make graph discovery fast is a great idea. I
know we talked about it a bit in Sydney, and I think it is really good
for this sort of thing.
The other thing I would mention, though. Is that most of the time you
will have an entry point into the graph. (Because you are opening a
Branch, and you want to get the revisions for that Branch that you don't
have).
Branch6 doesn't include the 'revision-history' (which is really a good
thing), but it still means that your first request doesn't have to be:
get_missing
it can be
get_missing(tip=A)
John
=:->
More information about the bazaar
mailing list