[merge][#288751] avoid creating text deltas spanning repository stacks
Aaron Bentley
aaron at aaronbentley.com
Wed Nov 19 15:51:42 GMT 2008
Martin Pool wrote:
> On Sat, Nov 15, 2008 at 2:30 AM, Aaron Bentley <aaron at aaronbentley.com> wrote:
>>> stacked repository. However, this was not checked properly.
>> This looks fine, so far. I think with these changes, "bzr pack" would
>> fix external references. Should bzr recommend this if it encounters
>> deltas that span repos?
>
> If pack fixes them, then it would make sense to recommend that people
> run it, but as John says this is not at present the case. It might be
> possible for it to be fixed to do so.
At minimum, we should implement this on reconcile, then. Possibly on
pack/autopack too. I guess I misread the patch.
>> Most of our APIs should be dealing with multiple texts, so their
>> implentations should be
>
> I agree. However, if the code in a particular place deals with one
> key at a time, I think it's clearer to have it say so directly, rather
> than using an api like get_parent_map that doesn't really express the
> right intention,
No disagreement here, BUT in those cases, we should also take a hard
look at WHY it's dealing with one key at a time, and fix it to deal with
multiple keys.
> * If we're looking for places to make things work on larger batches,
> it's arguably easier to do this if they're explicitly doing so,
"explicitly doing so" = explicitly operating on single entries?
> rather
> than calling a multi method with one value. For instance, you can
> grep for has_key() calls.
>
> * Simpler code is better.
>
> * Repetitive boilerplate risks errors - for instance the thing I
> changed was checking the result of the returned dictionary, which is
> approximately but not quite the right assertion. If we change the
> semantics of these methods (eg John's recent thread about returning
> extra values) it can break things like this.
>
> Is this a better rationale?
Sure.
>>> @@ -1330,7 +1339,17 @@
>>> # Raise an error when a record is missing.
>>> if record.storage_kind == 'absent':
>>> raise RevisionNotPresent([record.key], self)
>>> - if record.storage_kind in knit_types:
>>> + elif ((record.storage_kind in knit_types)
>>> + and (not parents
>>> + or self._index.has_key(parents[0])
>>> + or not self._fallback_vfs
>>> + or not self.has_key(parents[0]))):
>>> + # we can insert the knit record literally if either it has no
>>> + # compression parent OR we already have its basis in this kvf
>>> + # OR the basis is not present even in the fallbacks.
>> Since has_key costs a lot more than bool(self._fallback_vfs), that
>> should come first.
>
> This comment made me question whether the logic here is exactly
> correct. You're right in terms of X or Y === Y or X, but it didn't
> match my intuitive sense expressed in the comment.
I had the same concerns and I spent quite a while thinking the whole
condition through.
>> It also seems like this misses the case where the key is present in the
>> basis *and* in the stream. Consider keys A B and C, where C has basis
>> B, and B has basis A. Say we get a stream with C(delta), B(delta),
>> A(fulltext), in that order. This logic would convert C and B into
>> fulltexts.
>
> Right, this is the same case that John raised.
>
> I was trying to avoid holding on to more than one record at a time --
> I think the point of this interface is to avoid holding the whole
> stream in memory. That seems to imply that we must insert each record
> as we see it either by expanding to a fulltext or inserting it as a
> delta. If we hold on to them to decide later, it may blow out memory.
I'm not sure what to do about this. At first, I thought it was an
impossibly-unlikely corner case, and now I'm not sure. Ideas?
>
>>> + # Not a fulltext, so we need to make sure the parent
>>> + # versions will also be present.
>>> # Note that pack backed knits don't need to buffer here
>>> # because they buffer all writes to the transaction level,
>>> # but we don't expose that difference at the index level. If
>>> # the query here has sufficient cost to show up in
>>> # profiling we should do that.
>>> - if basis_parent not in self.get_parent_map([basis_parent]):
>>> + #
>>> + # They're required to be physically in this
>>> + # KnitVersionedFiles, not in a fallback.
>>> + if not self.has_key(parents[0]):
>> if not self.missing_keys(parents) would also work nicely here.
>
> Presumably you mean missing_keys([parents[0]]), and that's kind of an
> example of the hazards of only having multi methods.
No, I meant what I wrote. You're doing: "Does this have a compression
parent? Is it present?". I'm doing: "Are all compression parents
present?" For current formats, they are equivalent, but mine is more
amenable to multi-parent formats.
>>> === modified file 'bzrlib/tests/interrepository_implementations/test_fetch.py'
>>> --- bzrlib/tests/interrepository_implementations/test_fetch.py 2008-08-10 07:00:59 +0000
>>> +++ bzrlib/tests/interrepository_implementations/test_fetch.py 2008-11-14 07:44:57 +0000
>>> @@ -101,6 +101,11 @@
>>> # generally do).
>>> try:
>>> to_repo.fetch(tree.branch.repository, 'rev-two')
>>> + except errors.BzrCheckError, e:
>>> + # Can also just raise a generic check errors; stream insertion
>>> + # does this to include all the missing data
>>> + self.assertRaises((errors.NoSuchRevision, errors.RevisionNotPresent),
>>> + to_repo.revision_tree, 'rev-two')
>> Why revision_tree, not get_revision?
>
> It's copied from the clause just below; I should actually just combine
> them into one except block.
Cool.
Aaron
More information about the bazaar
mailing list