stacking and fetching bugs, dev6 issues

Sat May 9 01:32:48 BST 2009

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

...

>>             resume_tokens, missing_keys = self.sink.insert_stream(
>>                 stream, from_format, [])
>>             if self.to_repository._fallback_repositories:
>>                 missing_keys.update(
>>                     self._parent_inventories(search.get_keys()))
>>
>> However, from what I can tell it
>>   a) Always does this, even for non stacked repositories, meaning we
>> now
>>      add duplicate inventory texts. Even further, we now add duplicate
>>      *fulltexts* for these inventories.
>>      This isn't broken in the sense that you'll miss data when you are
>>      done, but it *is* broken in the sense that you know copy far more
>>      data than you need to. I'm guessing this actually might be the
>>      reason for some of the large fetches I've seen recently. (3 small
>>      revs now also fetch the complete fulltext chain for all
>> ancestors,
>>      even though all of those revisions are already present in the
>>      target repo.)
> 
> That would be a bug. Note that missing_keys is set by the return value
> from insert_stream, and that it gets its value by asking the knits for
> unsatisfied deltas and missing adjacent inventories. Neither should
> return anything in an unstacked repo.

Well, I missed the part where it said
"self.to_repository._fallback_repositories", but it still remains that
it does updates the missing_keys directly, rather than having the
'StreamSink.insert_stream' return them.

So I guess with stacked repos, it will just do the wrong thing on the
second push, and parents aren't filled for non-stacked.

> 
>>   b) It depends on the first 'insert_stream' to not actually commit
>> the
>>      incomplete repository (because it expects there will be missing
>>      keys due to compression parents). However, for dev6 repositories
>>      there will be no incomplete texts. Which means that instead we
>>      commit the required texts for the given revisions, and then 'come
>>      back later', and fill in the fulltexts for other inventories.
> 
> No, it doesn't depend on that; adjacent inventories are not incomplete
> in 1.9 repos either; all it depends on is suspending the write group.
> 

But the write group is *only* suspended if there are missing compression
parents. Otherwise at the end of 'insert_stream' it will have committed
the new pack.

And since the _parent_inventories check is done outside of
'insert_stream' ....

>> (a) is something that I think we just need a generic fix for. I think
>> it
>> could be something as simple as:
> ...
>> I'd like Robert or Andrew to confirm, though.
> 
> Please write a test/code we can play with to demonstrate it. In theory
> you are wrong, but in practice you may be right. We should confirm it
> before adding code that isn't needed.
> 
>> for (b) this is a bit trickier, but it also seems like a generic
>> problem
>> for current formats. It just 'happens' that we rarely will commit the
>> data before filling in the extra inventories. But certainly one could
>> construct a situation where the first insert doesn't have any missing
>> compression parents, and then a user does ^C at the right time.
>> (Consider the case of a merge commit that doesn't introduce anything
>> new, and ends on a delta boundary for the inventory. The new inventory
>> is a fulltext, but it then has to upload the fulltext for both merge
>> parents, which have long delta chains, giving plenty of time for ^C.)
> 
> Again, we don't commit, so the data isn't visible.
> 

Yes we do. Now if the "missing_keys.update(_parent_inventories())" was
inside 'StreamSink.insert_stream' and it chose that moment to
'suspend_write_group', then I would agree with you. But that, I believe,
is what I'm saying we need for (b). (a) becomes just a small change to
avoid copying extra inventories for stacked branches that already have
some data.

I'm not really sure how to test this, since the various layers hide
doubled data... Any quick ideas?

>> Now for stacking dev6, I can take advantage of
>> 'get_stream_for_missing_keys()' to change the logic around inventories
>> to copy all referenced chk pages. So at least I have a way to detect
>> when I need to transmit all references, versus just the minimal set.
> 
> Yes. Feel free to improve the interface to meet your needs. (Note that
> you may need to make network API changes in that case, and Andrew or I
> will be delighted to describe how and where to do that).
> 
>> However, it certainly feels like some of this should be pushed down
>> into
>> StreamSource/StreamSink, rather than existing in RepoFetcher.
> 
> SS/SS/RF are a cluster working together; feel free to move stuff around
> as needed. the design constraints are:
> SSource cannot access the RF or the SSink, and may only know static
> request parameters (such as the SSink format, revisions wanted etc).
> 
> SSink cannot access the RF or the SSource, and may only know static
> parameters, and not anything that we may be streaming to find (it can't
> know the full list of revisions a-priori).
> 
> SSource and SSink may be remoted. RF is always local. So any methods
> that RF calls on either need to be suitable for calling over the network
> - which means that the repositories they are called on must be presumed
> to be 'raw' - that is never configured as stacking.
> 
> -Rob

K, I'll try to keep that in mind.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkoEzzAACgkQJdeBCYSNAAMRXwCfUqmd6htf7PBEFqMlZocoU/pE
ej8Anj6gRsRTR9fXCqFgsuQDL33YrEFM
=1kPq
-----END PGP SIGNATURE-----