Early numbers on multi-parent diffs
Aaron Bentley
aaron.bentley at utoronto.ca
Wed Apr 11 19:21:00 BST 2007
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
John Arbash Meinel wrote:
> Aaron Bentley wrote:
>> Aaron Bentley wrote:
>>> I've written an implementation of multi-parent diffs, and early numbers
>>> do show significant space wins.
Okay, I've now added annotations, snapshots and gzipping. I think the
outputs are equivalent. I've even forced snapshotting to use the same
revisions as the knit. So I think this is, at last, a fair comparison.
But the size advantange seems to have dropped significantly.
On my box, the knit for builtins.py is 2.6M, and the pseudo-knit is
2.2M. Uncompressed, they are 15M and 16M. It seems like what knit
diffs lack in redundancy-reduction, they almost make up for in pure
terseness.
GZIPPED UNCOMPRESSED
OLD NEW OLD NEW
builtins.py 2.6M 2.2M 83% 16M 14M 86%
README 14K 16K 86% 58K 51K 88%
NEWS 1.8M 1.5M 82% 6.8M 5.8M 85%
inventory 18M 13M 69% 62M 42M 68%
> One of the ones I'm most interested in would be inventory.knit. Care
> to try it there? (I realize that will take the longest, but might also
> see the largest gain).
Yeah, I had to disable caching to get it to complete in a reasonable
timeframe. For some strange reason, the time to retrieve a given
version is slower for MultiParent than for standard knit: 10.2s vs 5.2s.
I'll have to investigate further, but my guess is that the snapshots
are suboptimally placed, causing us to go far past 25 revisions
following the righthand parent.
>> More news: they're also fast on read. I implemented a non-naive text
>> regeneration routine, and it can regenerate a revision of NEWS 1000
>> times in 1.81 seconds. This compares with KnitVersionedFile, which
>> takes 20.69 seconds, with caching enabled. (i.e. 11x slower).
> I know our current knit code is pretty sub-optimal about extracting
> texts. Partially because it tries to optimize for the (uncommon) case of
> extracting multiple texts at the same time.
I guess it depends what you mean by "uncommon". Log does it, and that
was what I was optimizing, when I tweaked the code. Diff doesn't do it,
but probably should.
I doubt those optimizations had much effect on single-version retrieval,
but the existing implementation was relatively naive, so an approach
like this one would probably improve single-version retrieval.
> I'm curious what your timings would be if you did a similar work on a
> regular knit.
A regular knit doesn't expose information about the relative positions
of common text sections, so that would involve either pre-processing the
data or changing the algorithm. That's certainly possible, but it may
make sense to change our storage to support our operations better.
Aaron
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQFGHScM0F+nu1YWqI0RAmZ8AJ4zeaOJ4oI8FX/MerGrA3aGfA0a8ACfezMW
fmAF0xXoYjT4Csj8FQr49aw=
=RkkT
-----END PGP SIGNATURE-----
More information about the bazaar
mailing list