[MERGE/RFC] partial delta generation

Tue Feb 3 05:31:35 GMT 2009

For logging directories and multiple files, the key
technical challenge is fast generation of partial deltas.
A partial delta for a given set of files and directories
includes those items, their parents and their children.

Partial deltas are desirable/needed for two reasons:

* an empty partial delta means a revision doesn't match
* display to the user when the -v flag is given.

There are at least 2 ways of generating a partial delta:

1. Create a full delta and filter it (as bzr.dev does now)
2. Create partial revision trees and compare those

And at least two ways of building a partial revision tree:

a. get the full inventory and filter it
b. only deserialise the interesting bit of the inventory.

Here are the times taken for the 3 algorithms on my laptop
for "log --short -r-100..-1 ChangeLog" on the emacs-merges
trunk:

1.  40.7s
2a. 24.6s
2b. 13.4s

And for the last 500 mainline revisions, the times are:

1.  3m05.6s
2a. 1m48.9s
2b. 0m55.3s

Keeping in mind that the Emacs trunk is deep (100k revisions)
but not particularly wide (~3k files), my expectation is that
these performance differences across the various partial
delta generation algorithms will be even larger for huge
trees like Mozilla (~50k files) and OOo.org (~87k files).
I hope to run some tests on OOo.org later this week to
confirm this.

FWIW, the per-file-graph algorithm currently used in bzr.dev
takes 32.9s and 34.5 secs respectively for the last 100 and
500 tests above. So for ranges up to around 250 revisions,
matching against partial deltas is faster. (It's also more
accurate in that it includes removes, kind changes, etc.
while, IIUIC, the per-file graph algorithm only finds
modifications and additions?) It clearly doesn't scale as
well as per-file graph though, so I'm planning to keep using
per-file graph for the single file - that isn't a directory -
use case. (When the split inventory format arrives, we can
hopefully get the delta really quickly so that will change
the timings and perhaps the balance here yet again.)

Anyhow, the bottom line is that I feel algorithm 2b is the
way to go, at least until the split inventory format lands
and some optimisations are added to get back a TreeDelta
object directly from the lower layers.

The attached patch implements 2b (and falls back to 2a if
it has to). I added tests at multiple layers: the xml
serialisation layer, the inventory layer and the top
layer, namely repository.get_revision_delta(). I *could*
add more tests for the numerous public repository methods
that have gained an extra parameter but I'm not sure if
that would add any real value. Reviewers may disagree,
of course.

Finally, I've made this an RFC because I'm certainly no
expert in this area and I'm not sure if my serialisation
code needs tweaking for rich-roots, etc. It probably does
and I probably need more tests for that?

Ian C.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: partial-deltas.patch
Type: text/x-diff
Size: 43987 bytes
Desc: not available
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20090203/9b0d456d/attachment-0001.bin