Introduction to history deltas

Tue Dec 6 23:13:31 GMT 2005

On Tue, 2005-12-06 at 08:58 -0500, Aaron Bentley wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Jan Hudec wrote:
> > On Tue, Dec 06, 2005 at 01:10:13 -0500, Aaron Bentley wrote:
> >>That is similar to knits.  I believe knits could be used to represent
> >>the diffs for history deltas, if using a weaveish format was desired.
> > 
> > 
> > There is one interesting point in the original suggestion: Instead of
> > storing all chunks for a file together, it stores all chunks for a
> > revision together.
> 
> Right.  So one way of looking at it is as a hunk organization method.
> You would either organize hunks by file (like weaves do) or by revision
> (history deltas).
> 
> > That would spead up remote transactions, because
> > you'd not have to copy all the files to create their new versions
> > atomically, but instead atomically install one file. For reading, index
> > files are needed in either case.
> 
> It's not a clear win.  With history deltas, your latency scales
> per-revision; with file-knits, your latency scales per-file.  So for
> operations like branch, you'd rather have per-file latency, (because you
> usually have more revisions than files) but for operations like pulling
> 3 revisions, you'd rather have per-revision latency.
> 
> With bzr.dev, the crossover point is 196 revisions.  Downloading 197
> revisions with history deltas would be 197 downloads, but with per-file
> knits, it would be 196.  (That is, if you have revno 1276, and you're
> updating to the current 1473.)

Thats an awesome statistic. I think the critical thing to consider is
this cross over.

One thing that occurs to me is that tracking a heavily merged into
branch affects the crossover - it will accrue relatively few revisions
itself, but each of those hauls in 20 or 30 or more revisions. This
leads to only needing 6 or 7 commits on the 'mainline' before pulling
from multiple knits is more efficient (in the bzr.dev case).

I've got a very bad feeling about grouping the data by revision until
someone does some hard statistics on the mean & std dev for the
crossover point for different projects. We currently have a significant
latency problem with data grouped by file id. Arch had a huge latency
problem with data grouped by revision. And for remote access - such as
determining the history of a single file - wheeeoo. That really bites
for arch as every revision has to be downloaded. If our data is grouped
by revision we have to download every touched revision. if our data is
grouped by fileid we can pull down just one file.

> ..That suggests that it's an advantage to optimize for
> single file access, locally.

And I agree with this 100%.

My summary is this:
We should trade off the actual data storage and transmission
requirements so that:
* local operations are very fast
* data transmitted for push and pull are proportional to changes, not to
total history size.

We should treat latency optimisation as a problem orthogonal to those.
There are many solutions:
 * smart servers that stream data
 * async behaviour in the core library
 * Dedicated optimised routines for hot spots that are async but
encapsulate it.

Lets not compromise the two constraints in my first paragraph simply
because we have not implemented any of the 3 latency specific
solutions ;)

Rob

-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20051207/33e4fa5f/attachment.pgp