RFC: parallelisable, resumable fetching

Sun Jul 4 00:44:27 BST 2010

This is a bit of a far-out idea, but having had it, I've turned it
over in my head a couple of times and it still seems plausible.

I'm sending this mail mainly to get reactions from folk: I've no
immediate plans to do a spike on this.

We now have, thanks to some excellent work by Andrew, the idea of
*suspended* insertions into pack databases (0.92 up to an including
2a, but only active for the stackable versions).

The way this works is simple: you write a pack and its indices to
disk, hash it, and then rather than insert it, you stop. The hash is
returned to the controlling logic and can be supplied as a dependency
for a later insertion.

If we wrote N suspended packs concurrently, we could parallelise all
but the final sanity-check-of-suspended-packs.

This would involve:
The server running N backends, and multiplexing the record streams.
The client farming out the validation of the received stream (perhaps?
This might not be an early bottleneck) and then inserting them all at
once.

Places that this would benefit is a bit of a murky question to answer,
because we're currently bottlenecked in CHK analysis, not in raw text
transmission, so we'd really want to get that analysis parallelising.

If we made the list of pack files that are received during the fetch
something we journal, we could in principle make fetching resumable, a
long requested feature.

E.g. we could have a 'in progress' file for 'branch' and 'checkout'
operations, which listed the original parameters, source tip and urls
etc - and if it all matches, picks up the list of already transmitted
packs and runs with it.

-Rob