[merge][#288751] avoid creating text deltas spanning repository stacks

Wed Nov 19 15:51:42 GMT 2008

Martin Pool wrote:
> On Sat, Nov 15, 2008 at 2:30 AM, Aaron Bentley <aaron at aaronbentley.com> wrote:

>>> stacked repository.  However, this was not checked properly.
>> This looks fine, so far.  I think with these changes, "bzr pack" would
>> fix external references.  Should bzr recommend this if it encounters
>> deltas that span repos?
> 
> If pack fixes them, then it would make sense to recommend that people
> run it, but as John says this is not at present the case.  It might be
> possible for it to be fixed to do so.

At minimum, we should implement this on reconcile, then.  Possibly on
pack/autopack too.  I guess I misread the patch.

>> Most of our APIs should be dealing with multiple texts, so their
>> implentations should be
> 
> I agree.  However, if the code in a particular place deals with one
> key at a time, I think it's clearer to have it say so directly, rather
> than using an api like get_parent_map that doesn't really express the
> right intention,

No disagreement here, BUT in those cases, we should also take a hard
look at WHY it's dealing with one key at a time, and fix it to deal with
multiple keys.

>  * If we're looking for places to make things work on larger batches,
> it's arguably easier to do this if they're explicitly doing so,

"explicitly doing so" = explicitly operating on single entries?

> rather
> than calling a multi method with one value.  For instance, you can
> grep for has_key() calls.
> 
>  * Simpler code is better.
> 
>  * Repetitive boilerplate risks errors - for instance the thing I
> changed was checking the result of the returned dictionary, which is
> approximately but not quite the right assertion.  If we change the
> semantics of these methods (eg John's recent thread about returning
> extra values) it can break things like this.
> 
> Is this a better rationale?

Sure.

>>> @@ -1330,7 +1339,17 @@
>>>              # Raise an error when a record is missing.
>>>              if record.storage_kind == 'absent':
>>>                  raise RevisionNotPresent([record.key], self)
>>> -            if record.storage_kind in knit_types:
>>> +            elif ((record.storage_kind in knit_types)
>>> +                  and (not parents
>>> +                       or self._index.has_key(parents[0])
>>> +                       or not self._fallback_vfs
>>> +                       or not self.has_key(parents[0]))):
>>> +                # we can insert the knit record literally if either it has no
>>> +                # compression parent OR we already have its basis in this kvf
>>> +                # OR the basis is not present even in the fallbacks.
>> Since has_key costs a lot more than bool(self._fallback_vfs), that
>> should come first.
> 
> This comment made me question whether the logic here is exactly
> correct.  You're right in terms of X or Y === Y or X, but it didn't
> match my intuitive sense expressed in the comment.

I had the same concerns and I spent quite a while thinking the whole
condition through.

>> It also seems like this misses the case where the key is present in the
>> basis *and* in the stream.  Consider keys A B and C, where C has basis
>> B, and B has basis A.  Say we get a stream with C(delta), B(delta),
>> A(fulltext), in that order.  This logic would convert C and B into
>> fulltexts.
> 
> Right, this is the same case that John raised.
> 
> I was trying to avoid holding on to more than one record at a time --
> I think the point of this interface is to avoid holding the whole
> stream in memory.  That seems to imply that we must insert each record
> as we see it either by expanding to a fulltext or inserting it as a
> delta.  If we hold on to them to decide later, it may blow out memory.

I'm not sure what to do about this.  At first, I thought it was an
impossibly-unlikely corner case, and now I'm not sure.  Ideas?

> 
>>> +                    # Not a fulltext, so we need to make sure the parent
>>> +                    # versions will also be present.
>>>                      # Note that pack backed knits don't need to buffer here
>>>                      # because they buffer all writes to the transaction level,
>>>                      # but we don't expose that difference at the index level. If
>>>                      # the query here has sufficient cost to show up in
>>>                      # profiling we should do that.
>>> -                    if basis_parent not in self.get_parent_map([basis_parent]):
>>> +                    #
>>> +                    # They're required to be physically in this
>>> +                    # KnitVersionedFiles, not in a fallback.
>>> +                    if not self.has_key(parents[0]):
>> if not self.missing_keys(parents) would also work nicely here.
> 
> Presumably you mean missing_keys([parents[0]]), and that's kind of an
> example of the hazards of only having multi methods.

No, I meant what I wrote.  You're doing: "Does this have a compression
parent?  Is it present?".  I'm doing: "Are all compression parents
present?"  For current formats, they are equivalent, but mine is more
amenable to multi-parent formats.

>>> === modified file 'bzrlib/tests/interrepository_implementations/test_fetch.py'
>>> --- bzrlib/tests/interrepository_implementations/test_fetch.py        2008-08-10 07:00:59 +0000
>>> +++ bzrlib/tests/interrepository_implementations/test_fetch.py        2008-11-14 07:44:57 +0000
>>> @@ -101,6 +101,11 @@
>>>          # generally do).
>>>          try:
>>>              to_repo.fetch(tree.branch.repository, 'rev-two')
>>> +        except errors.BzrCheckError, e:
>>> +            # Can also just raise a generic check errors; stream insertion
>>> +            # does this to include all the missing data
>>> +            self.assertRaises((errors.NoSuchRevision, errors.RevisionNotPresent),
>>> +                              to_repo.revision_tree, 'rev-two')
>> Why revision_tree, not get_revision?
> 
> It's copied from the clause just below; I should actually just combine
> them into one except block.

Cool.

Aaron