stacking and fetching bugs, dev6 issues

Fri May 8 22:45:39 BST 2009

On Fri, 2009-05-08 at 14:30 -0500, John Arbash Meinel wrote:
> 
> Note, even though I pushed this up with bzr.dev, I'm getting:
> ErrorFromSmartServer: Error received from smart server:
> ('NoSuchRevision',)
> when trying to branch it. So we still have some issues with stacking
> on
> Launchpad.

Thats because lp hasn't applied the cherrypicks AFAIK. In particular the
NoSuchRevision one is
https://bugs.edge.launchpad.net/launchpad-code/+bug/360791.

> 1) Groupcompress doesn't allow deltas between groups, much less
> between
> pack files, and even less so between repositories. Thats a good thing
> as
> far as stacking is generally concerned, but it means that these
> repositories don't have missing compression parents to indicate they
> are
> not complete. Which means certain stacking tests fail, which expect
> you
> to be able to create an uncommittable stream. So for now, I added:
> 
>   RepositoryFormat._deltas_across_repos
> 
> To indicate when the format doesn't have the capability for deltas
> across repos, so we can skip those tests. Does this seem reasonable?

I think that flag is misnamed, as 1.9 doesn't delta across repos either
- those tests are ensuring that no format does that. Rather I think the
test should be altered to detect that the repo refused to be put into
that situation and skip/return. We should have a single 'it can be put
into that situation' test outside the interface tests.

> 2) The ability to 'fill in' parent inventories is actually being
> handled
> in fetch.RepoFetcher._fetch_everything_for_search. Namely it is doing:
>             resume_tokens, missing_keys = self.sink.insert_stream(
>                 stream, from_format, [])
>             if self.to_repository._fallback_repositories:
>                 missing_keys.update(
>                     self._parent_inventories(search.get_keys()))
> 
> However, from what I can tell it
>   a) Always does this, even for non stacked repositories, meaning we
> now
>      add duplicate inventory texts. Even further, we now add duplicate
>      *fulltexts* for these inventories.
>      This isn't broken in the sense that you'll miss data when you are
>      done, but it *is* broken in the sense that you know copy far more
>      data than you need to. I'm guessing this actually might be the
>      reason for some of the large fetches I've seen recently. (3 small
>      revs now also fetch the complete fulltext chain for all
> ancestors,
>      even though all of those revisions are already present in the
>      target repo.)

That would be a bug. Note that missing_keys is set by the return value
from insert_stream, and that it gets its value by asking the knits for
unsatisfied deltas and missing adjacent inventories. Neither should
return anything in an unstacked repo.

>   b) It depends on the first 'insert_stream' to not actually commit
> the
>      incomplete repository (because it expects there will be missing
>      keys due to compression parents). However, for dev6 repositories
>      there will be no incomplete texts. Which means that instead we
>      commit the required texts for the given revisions, and then 'come
>      back later', and fill in the fulltexts for other inventories.

No, it doesn't depend on that; adjacent inventories are not incomplete
in 1.9 repos either; all it depends on is suspending the write group.

> (a) is something that I think we just need a generic fix for. I think
> it
> could be something as simple as:
...
> 
> I'd like Robert or Andrew to confirm, though.

Please write a test/code we can play with to demonstrate it. In theory
you are wrong, but in practice you may be right. We should confirm it
before adding code that isn't needed.

> for (b) this is a bit trickier, but it also seems like a generic
> problem
> for current formats. It just 'happens' that we rarely will commit the
> data before filling in the extra inventories. But certainly one could
> construct a situation where the first insert doesn't have any missing
> compression parents, and then a user does ^C at the right time.
> (Consider the case of a merge commit that doesn't introduce anything
> new, and ends on a delta boundary for the inventory. The new inventory
> is a fulltext, but it then has to upload the fulltext for both merge
> parents, which have long delta chains, giving plenty of time for ^C.)

Again, we don't commit, so the data isn't visible.

> 
> Now for stacking dev6, I can take advantage of
> 'get_stream_for_missing_keys()' to change the logic around inventories
> to copy all referenced chk pages. So at least I have a way to detect
> when I need to transmit all references, versus just the minimal set.

Yes. Feel free to improve the interface to meet your needs. (Note that
you may need to make network API changes in that case, and Andrew or I
will be delighted to describe how and where to do that).

> However, it certainly feels like some of this should be pushed down
> into
> StreamSource/StreamSink, rather than existing in RepoFetcher.

SS/SS/RF are a cluster working together; feel free to move stuff around
as needed. the design constraints are:
SSource cannot access the RF or the SSink, and may only know static
request parameters (such as the SSink format, revisions wanted etc).

SSink cannot access the RF or the SSource, and may only know static
parameters, and not anything that we may be streaming to find (it can't
know the full list of revisions a-priori).

SSource and SSink may be remoted. RF is always local. So any methods
that RF calls on either need to be suitable for calling over the network
- which means that the repositories they are called on must be presumed
to be 'raw' - that is never configured as stacking.

-Rob
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20090509/d0fa1b60/attachment.pgp