[merge][#288751] avoid creating text deltas spanning repository stacks

Wed Nov 19 16:24:29 GMT 2008

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Aaron Bentley wrote:
> Martin Pool wrote:
>> On Sat, Nov 15, 2008 at 2:30 AM, Aaron Bentley <aaron at aaronbentley.com> wrote:
> 

...

> 
>>> It also seems like this misses the case where the key is present in the
>>> basis *and* in the stream.  Consider keys A B and C, where C has basis
>>> B, and B has basis A.  Say we get a stream with C(delta), B(delta),
>>> A(fulltext), in that order.  This logic would convert C and B into
>>> fulltexts.
>> Right, this is the same case that John raised.
>>
>> I was trying to avoid holding on to more than one record at a time --
>> I think the point of this interface is to avoid holding the whole
>> stream in memory.  That seems to imply that we must insert each record
>> as we see it either by expanding to a fulltext or inserting it as a
>> delta.  If we hold on to them to decide later, it may blow out memory.
> 
> I'm not sure what to do about this.  At first, I thought it was an
> impossibly-unlikely corner case, and now I'm not sure.  Ideas?

At this exact time, it is "slightly" common. Packs generally don't care
about ordering so it transmits things in whatever it considers to be
'optimal' in terms of number of round-trips, etc. It could do slightly
better and say "what is optimal in terms of round-trips, lightly
constrained by the ancestry", rather than using the sort ordering of
dicts/sets. (I want to read bytes 10-20 from pack A, and 15-30 from pack
B, should I read pack A or pack B first?)

Because of the random ordering, we can probably easily hit cases where
deltas are in (reverse?) topological order.

I'll also note that "bzr pack" doesn't change the sort order for
inventories or file texts, only for revisions (it does newest revision
first). At some point we would want 'bzr pack' to re-evaluate the texts
and insert them in the "best" order. Which is likely to be fulltext then
deltas, constrained that we want newest revisions to come earlier in the
file. (eg. reverse topological overall, but topological in the short term.)

As I understand stacked branches, the idea is that we aren't pushing up
a lot of ancestry anyway. So we'll probably get it wrong in a few
places, and store more than is necessary, but not critically so.

I guess the other problem is that we preserve what we find in most
cases, so that merging a stacked branch into trunk will then bring in
the texts *as* fulltexts, and not shrink them even though it has the
info to do so.

Again, this is a reason to make "bzr pack" do more work and allow it to
generate different deltas, etc, to make things more optimal.

> 
>>>> +                    # Not a fulltext, so we need to make sure the parent
>>>> +                    # versions will also be present.
>>>>                      # Note that pack backed knits don't need to buffer here
>>>>                      # because they buffer all writes to the transaction level,
>>>>                      # but we don't expose that difference at the index level. If
>>>>                      # the query here has sufficient cost to show up in
>>>>                      # profiling we should do that.
>>>> -                    if basis_parent not in self.get_parent_map([basis_parent]):
>>>> +                    #
>>>> +                    # They're required to be physically in this
>>>> +                    # KnitVersionedFiles, not in a fallback.
>>>> +                    if not self.has_key(parents[0]):
>>> if not self.missing_keys(parents) would also work nicely here.
>> Presumably you mean missing_keys([parents[0]]), and that's kind of an
>> example of the hazards of only having multi methods.
> 
> No, I meant what I wrote.  You're doing: "Does this have a compression
> parent?  Is it present?".  I'm doing: "Are all compression parents
> present?"  For current formats, they are equivalent, but mine is more
> amenable to multi-parent formats.

Actually, you brought up an interesting point, and both of you are
slightly wrong. The issue is that we have a separate graph with
"compression_parent", which is not identical to the left-hand file-graph
ancestor in all cases.

The only time where it differs is during "bzr reconcile" (or Vincent's
direct patch to the mysql codebase). If reconcile finds that the
per-file ancestors are "swapped", it will swap them back without
changing the file's compression parent (since otherwise it would have to
extract and fulltext and insert it back.)

I found a bug in the annotate code (it assumed compression_parent ==
parents[0]) and this seems to be reproducing the same bug.

Unfortunately, looking at the KnitContentFactory() object, there is no
record of compression parent as a separate value from the "self.parents".

Again, looking at KnitVersionedFile.get_record_stream() it looks at
get_build_details(), but then throws away that information by stripping
it down to just a "positions[key]" dictionary.

So it would seem that the generic fetch code causes corruption on
reconciled trees, because it will fetch the knit delta, and then change
the compression parent. (because it doesn't transmit the actual value)

Separate bug, but still a serious one nonetheless.

Anyway, I was hoping to say "use compression_parent" and for
multi-parent formats they will just have a list instead of a single
reference. (Arguably we should upgrade it to a list from now, and just
allow it to have a single value.)

It does seem that our get/insert record stream needs to be fixed anyway.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkkkPb0ACgkQJdeBCYSNAAOCVQCfVXlSxVG4rrgrNHrD35CBwmxm
e4sAoIZdgK0+Rynrv0GdIOvFc4rVSAQ0
=yZnf
-----END PGP SIGNATURE-----