[merge][#288751] avoid creating text deltas spanning repository stacks

Wed Nov 19 07:49:28 GMT 2008

On Sat, Nov 15, 2008 at 2:30 AM, Aaron Bentley <aaron at aaronbentley.com> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> bb:tweak
>
> Fixing this is an immediate benefit.  I don't think this does any actual
> harm, but there a bunch of issues worth looking at and possibly cleaning up.

Thanks for the speedy review.

> Martin Pool wrote:
>> This patch fixes a bug which causes some trouble with
>> stacked repositories.  The design was that text deltas would
>> not span repository stacks, for the sake of robustness, and
>> so that we can upgrade the inventory representation in
>> underlying repositories without causing trouble to the
>> stacked repository.  However, this was not checked properly.
>
> This looks fine, so far.  I think with these changes, "bzr pack" would
> fix external references.  Should bzr recommend this if it encounters
> deltas that span repos?

If pack fixes them, then it would make sense to recommend that people
run it, but as John says this is not at present the case.  It might be
possible for it to be fixed to do so.

>> This adds has_key(key) to some index(like) objects.  I
>> realize we don't generally want people checking for one key
>> at a time, but in some cases that's all the caller
>> reasonably has, and it's better than making them repeat
>> boilerplate to use get_parent_map.
>
> I don't mind the existence of the API, but I'm not sure I agree with the
> rationale.
>
> Most of our APIs should be dealing with multiple texts, so their
> implentations should be

I agree.  However, if the code in a particular place deals with one
key at a time, I think it's clearer to have it say so directly, rather
than using an api like get_parent_map that doesn't really express the
right intention, for a few reasons:

 * If we're looking for places to make things work on larger batches,
it's arguably easier to do this if they're explicitly doing so, rather
than calling a multi method with one value.  For instance, you can
grep for has_key() calls.

 * Simpler code is better.

 * Repetitive boilerplate risks errors - for instance the thing I
changed was checking the result of the returned dictionary, which is
approximately but not quite the right assertion.  If we change the
semantics of these methods (eg John's recent thread about returning
extra values) it can break things like this.

Is this a better rationale?

>> +    def has_key(self, key):
>> +        """Check if this index has one key.
>> +
>> +        If it's possible to check for multiple keys at once through
>> +        calling get_parent_map that should be faster.
>> +        """
>> +        return (key in self.get_parent_map([key]))
>
> I'm surprised to see so many identical implementations of has_key.
> Perhaps we could have one implementation and just assign it to the class?

You're right, that was kind of lazy.

>> @@ -1330,7 +1339,17 @@
>>              # Raise an error when a record is missing.
>>              if record.storage_kind == 'absent':
>>                  raise RevisionNotPresent([record.key], self)
>> -            if record.storage_kind in knit_types:
>> +            elif ((record.storage_kind in knit_types)
>> +                  and (not parents
>> +                       or self._index.has_key(parents[0])
>> +                       or not self._fallback_vfs
>> +                       or not self.has_key(parents[0]))):
>> +                # we can insert the knit record literally if either it has no
>> +                # compression parent OR we already have its basis in this kvf
>> +                # OR the basis is not present even in the fallbacks.
>
> Since has_key costs a lot more than bool(self._fallback_vfs), that
> should come first.

This comment made me question whether the logic here is exactly
correct.  You're right in terms of X or Y === Y or X, but it didn't
match my intuitive sense expressed in the comment.

We can insert the record literally if either it's a knit record type,
and either it has no compression parents, or the compression parent is
in this kvf, or the compression parent is not in any fallback kvf.  So
we can in fact commute the expression, because if there are no
fallback vfs, then it's always ok to insert it, and either the parent
is present, or we'll come back and insert the parent later on, or
we'll fail.

I should also add a note that calling self.has_key is inefficient,
because it queries this vfs index again, when we only want to check
all the fallbacks.

>
>> +            elif ((record.storage_kind in knit_types)
>> +                  and (not parents
>> +                       or not self._fallback_vfs
>> +                       or self._index.has_key(parents[0])
>> +                       or not self.has_key(parents[0]))):
>
>>                  # In the
>> +                # last case it will either turn up later in the stream and all
>> +                # will be well, or it won't turn up at all and we'll raise an
>> +                # error at the end.
>
> It would be nice to raise here, though.  Perhaps some time we can revise
> the stream format so we get all its metadata, or at least a list of
> keys, up-front.

That would be nice.  There's a fundamental tension in wanting to
stream everything on both the sender and receiver; the sender doesn't
want to generate a list of everything up front but it would certainly
help the receiver.  We had a similar discussion about it being useful
to know when all the data for one revision has been sent, so that the
receiver could eg checkpoint that into the repository by finishing a
write group.

>
> It also seems like this misses the case where the key is present in the
> basis *and* in the stream.  Consider keys A B and C, where C has basis
> B, and B has basis A.  Say we get a stream with C(delta), B(delta),
> A(fulltext), in that order.  This logic would convert C and B into
> fulltexts.

Right, this is the same case that John raised.

I was trying to avoid holding on to more than one record at a time --
I think the point of this interface is to avoid holding the whole
stream in memory.  That seems to imply that we must insert each record
as we see it either by expanding to a fulltext or inserting it as a
delta.  If we hold on to them to decide later, it may blow out memory.

>
>
>>                  if record.storage_kind not in native_types:
>>                      try:
>>                          adapter_key = (record.storage_kind, "knit-delta-gz")
>> @@ -1358,15 +1377,19 @@
>>                  index_entry = (record.key, options, access_memo, parents)
>>                  buffered = False
>>                  if 'fulltext' not in options:
>> -                    basis_parent = parents[0]
>
> Is this change because you feel parents[0] is clearer than basis_parent?

It's because I need to start using parents[0] inside the if()
expression where I can't assign to a variable, and I wanted to make it
consistent throughout the function.

>> +                    # Not a fulltext, so we need to make sure the parent
>> +                    # versions will also be present.
>>                      # Note that pack backed knits don't need to buffer here
>>                      # because they buffer all writes to the transaction level,
>>                      # but we don't expose that difference at the index level. If
>>                      # the query here has sufficient cost to show up in
>>                      # profiling we should do that.
>> -                    if basis_parent not in self.get_parent_map([basis_parent]):
>> +                    #
>> +                    # They're required to be physically in this
>> +                    # KnitVersionedFiles, not in a fallback.
>> +                    if not self.has_key(parents[0]):
>
> if not self.missing_keys(parents) would also work nicely here.

Presumably you mean missing_keys([parents[0]]), and that's kind of an
example of the hazards of only having multi methods.

>> === modified file 'bzrlib/repofmt/knitrepo.py'
>> --- bzrlib/repofmt/knitrepo.py        2008-09-08 14:00:22 +0000
>> +++ bzrlib/repofmt/knitrepo.py        2008-11-14 05:09:33 +0000
>> @@ -124,9 +124,10 @@
>>          self._commit_builder_class = _commit_builder_class
>>          self._serializer = _serializer
>>          self._reconcile_fixes_text_parents = True
>> -        self._fetch_uses_deltas = True
>>          self._fetch_order = 'topological'
>>
>> +    _fetch_uses_deltas = True
>> +
>
> What motivated this change?  Why change only _fetch_uses_deltas, not the
> other members?

It's left over from when I thought I'd fix this by forcing
_fetch_uses_deltas to False on stacked repositories.  I'm not sure if
it makes sense to be set per-instance in only some cases, but I'll put
it back.

>
>
>> === modified file 'bzrlib/tests/interrepository_implementations/test_fetch.py'
>> --- bzrlib/tests/interrepository_implementations/test_fetch.py        2008-08-10 07:00:59 +0000
>> +++ bzrlib/tests/interrepository_implementations/test_fetch.py        2008-11-14 07:44:57 +0000
>> @@ -101,6 +101,11 @@
>>          # generally do).
>>          try:
>>              to_repo.fetch(tree.branch.repository, 'rev-two')
>> +        except errors.BzrCheckError, e:
>> +            # Can also just raise a generic check errors; stream insertion
>> +            # does this to include all the missing data
>> +            self.assertRaises((errors.NoSuchRevision, errors.RevisionNotPresent),
>> +                              to_repo.revision_tree, 'rev-two')
>
> Why revision_tree, not get_revision?

It's copied from the clause just below; I should actually just combine
them into one except block.

-- 
Martin <http://launchpad.net/~mbp/>