Early numbers on multi-parent diffs

Aaron Bentley aaron.bentley at utoronto.ca
Wed Apr 11 19:21:00 BST 2007


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

John Arbash Meinel wrote:
> Aaron Bentley wrote:
>> Aaron Bentley wrote:
>>> I've written an implementation of multi-parent diffs, and early numbers
>>> do show significant space wins.

Okay, I've now added annotations, snapshots and gzipping.  I think the
outputs are equivalent.  I've even forced snapshotting to use the same
revisions as the knit.  So I think this is, at last, a fair comparison.

But the size advantange seems to have dropped significantly.

On my box, the knit for builtins.py is 2.6M, and the pseudo-knit is
2.2M.  Uncompressed, they are 15M and 16M.  It seems like what knit
diffs lack in redundancy-reduction, they almost make up for in pure
terseness.

            GZIPPED       UNCOMPRESSED
            OLD  NEW      OLD  NEW
builtins.py 2.6M 2.2M 83% 16M  14M  86%
README      14K  16K  86% 58K  51K  88%
NEWS        1.8M 1.5M 82% 6.8M 5.8M 85%
inventory   18M  13M  69% 62M  42M  68%

> One of the ones I'm most interested in would be inventory.knit. Care
> to try it there? (I realize that will take the longest, but might also
> see the largest gain).

Yeah, I had to disable caching to get it to complete in a reasonable
timeframe.  For some strange reason, the time to retrieve a given
version is slower for MultiParent than for standard knit: 10.2s vs 5.2s.
 I'll have to investigate further, but my guess is that the snapshots
are suboptimally placed, causing us to go far past 25 revisions
following the righthand parent.

>> More news: they're also fast on read.  I implemented a non-naive text
>> regeneration routine, and it can regenerate a revision of NEWS 1000
>> times in  1.81 seconds.  This compares with KnitVersionedFile, which
>> takes 20.69 seconds, with caching enabled.  (i.e. 11x slower).

> I know our current knit code is pretty sub-optimal about extracting
> texts. Partially because it tries to optimize for the (uncommon) case of
> extracting multiple texts at the same time.

I guess it depends what you mean by "uncommon".  Log does it, and that
was what I was optimizing, when I tweaked the code.  Diff doesn't do it,
but probably should.

I doubt those optimizations had much effect on single-version retrieval,
but the existing implementation was relatively naive, so an approach
like this one would probably improve single-version retrieval.

> I'm curious what your timings would be if you did a similar work on a
> regular knit.

A regular knit doesn't expose information about the relative positions
of common text sections, so that would involve either pre-processing the
data or changing the algorithm.  That's certainly possible, but it may
make sense to change our storage to support our operations better.

Aaron
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGHScM0F+nu1YWqI0RAmZ8AJ4zeaOJ4oI8FX/MerGrA3aGfA0a8ACfezMW
fmAF0xXoYjT4Csj8FQr49aw=
=RkkT
-----END PGP SIGNATURE-----



More information about the bazaar mailing list