Mutating history in Subversion and Bazaar

Thu Aug 31 18:54:04 BST 2006

Aaron Bentley wrote:
...

> But if you use a clever fetcher that works by slinging knit deltas
> around, then yes, it's conveivable to corrupt the knit.  Knits store
> sha1 hashes, so the corruption would be easy to detect.
> 
> I don't know whether we check sha-1's when copying deltas from one knit
> to another.  We could do that, or we could make sure that the sha-1 of
> the parent in the target knit matches the sha-1 of the parent in the
> source knit.  So for knits, the knit itself contains enough data to
> verify that you're not creating a version that cannot be constructed.
> 

Unfortunately we don't. We check the sha-1 sums when we extract the
texts, but to check them at 'join()' time means that we have to extract
the content, apply it to the previous text in the case it is a delta,
which may mean extracting all of the other previous texts, etc.

Robert explicitly changed the design of knit hunks to use the raw
revision id as the annotation marker (rather than index, or something
like that), because it allows us to copy the gzip hunk without doing
anything but verify that the first part of the hunk is correct.

The only thing that verifies everything before applying it is Bundles,
and they are horribly slow. I'm sure we can do a lot better than what we
have (by caching some of the intermediates, etc). But part of the issue
is our design.

We merge knit contents one knit at a time, not one revision at a time.
But our validation is across the whole revision (whether it is checking
the inventory => knit sha1, or if we want to extract a complete testament).

Right now we do:
1) download inventories to find out what file ids are modified, keep
these cached in ram.

2) for each file id found in (1):
    join the remote knit to the local knit (data + index),
    for a given set of revisions

3) join the remote inventory.knit and .kndx to the local inventory.knit
(the downloaded contents should still be cached in ram)

4) join the remote revisions.knit and .kndx to the local revisions.knit.

Now, if we made some steps more explicit, we could do a little bit more
integrity checking. The first thing that I would add is

2b) for each file id in (1):
	download the knit data
	extract each hunk into a fulltext and assert that the sha1
	value matches the sha1 of the inventory in (1)
	copy the compressed hunk into the local data

That at least would insure that our inventory agrees with our knits. And
doesn't require us to cache across the whole downloaded set. Just enough
to build the full texts that are being asserted.

To really get integrity, we should also do:

5) Change all of the previous steps to only store the value for the
index in memory (don't write the index to disk yet).
After (4) create a testament from the data in the inventory.knit and
revisions.knit (you don't need the actual texts because we already
checked that the sha1 sum in the inventory was correct).
Once you have validated all the testaments (including checking any gpg
signatures at this point). Commit the .kndx records to disk.

We already cache the contents of knit indexes in memory, so this doesn't
add any extra memory consumption. It adds a little bit of bloat, in the
case that 1-4 happened, but someone hit ^C at step 5. But it adds
integrity because if there isn't an index entry, we ignore the data in
the knit file itself.

It also gives us a reasonable time to validate all of the gpg signatures.

Probably the big performance loss would be the time to extract all of
the full texts, but if you just extracted a full text, and then you are
stacking a new one on top of it, we should be able to do that quite fast.

...

> I think it sounds pretty good.  Unfortunately, to be really sure there
> are no discrepancies between two repositories, you have to compare every
> common revision, because the discrepancy may be some time long in the past.

The other problem with knits, is that the sha1 sum is not stored in the
index. It is stored as part of the first line in the gzip'd data chunk.
So you can't just read the 2 indexes, and compare all of their sha1 sums.

> 
>>> Violating the integrity of a distributed database is certainly not a
>>> nice thing to do, but I hope that we can find a way to control the
>>> splash damage enough to make transparent interoperability with other
>>> systems a reliable proposition.
> 
> I think we can prevent splash damage.  I'm not sure what we do when
> we've discovered that a revision's data is inconsistent.

Mostly just puke, because it is better than getting corruption.

> 
> Aaron

John
=:->

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 254 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060831/67ef2c6e/attachment.pgp