Smart revision fetching update

Thu Aug 9 07:27:57 BST 2007

Hi all,

Thanks to a dose of the flu I didn't get support for transferring revision data
efficiently over the smart server protocol ready in time for 0.19 (0.90).  The
good news is it will have most of the 0.91 cycle to get settled and nicely
polished.

Of the code in my http://people.ubuntu.com/~andrew/bzr/repo-refactor branch,
about 2/3rds has been extracted into independent branches and reviewed already,
so I'll land that as soon as 0.90 opens.  The remaining 1/3rd basically adds a
Repository.fetch_revisions smart method and teaches the client side to use it,
and I'll send that to the list within a day.

The work to be done from there:

  * Add a specialised smart method for the initial pull case.  At the moment
    with my code the initial pull of a branch retrieves the branch's ancestry,
    and then sends a big Repository.fetch_revisions call explicitly listing
    every revision.  So for branching bzr at the moment it would send over 12000
    revision IDs across the wire, the same ones it just downloaded.  So probably
    I'll add a “Repository.fetch_all_revisions” smart request that just takes
    the tip revision ID instead.  This will make this code a strict improvement
    over the current tarball hack.

  * Write a smart method for pushing revisions.  This is basically symmetrical
    with the pull case.  A “Repository.add_revisions” request with a body of a
    revision data stream.

  * Write fallbacks for transfers between different format repositories, where
    the raw knit data can't just be blindly copied.  Perhaps we need a new set
    of parameterised tests for this?

  * Measure, and probably fix, memory consumption.  I expect at the moment my
    code is buffering the entire request/response bodies in memory, which is Not
    Good.  It would be nice to have something similar to the benchmark test
    suite that measures the memory high-water mark of various operations.  Or
    perhaps just set a limit with or ulimit/setrusage, and then feed in a data
    set larger than the limit: if it trips the limit, the memory consumption
    needs fixing.  A quick and dirty hack would be to limit requests to e.g.
    100 revisions at a time, but I think we can fix this properly.

And, of course, test this as much as possible in real usage!  :)

After that, I expect we'll want to take a close look at logs of various
operations with the -Dhpss flag on, and see if there's other low-hanging fruit
to fix.

People that are keen are welcome to checkout the “repo-refactor” branch above
and start testing already.  It should already make a noticeable difference on
pulling, although I haven't yet checked to see how much.

-Andrew.