stacking and fetching bugs, dev6 issues

Fri May 8 20:30:58 BST 2009

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I'm trying to work out the 'correct' way to handle some of the stacking
requirements for --dev6 repositories. In the process, I think I've
uncovered some issues with existing fetch code, and I wanted to check
with people.

For starters, I have a branch here, which adds
'supports_external_references=True' for dev6 and a passing test suite:
  lp:~jameinel/bzr/1.15-gc-stacking

Note, even though I pushed this up with bzr.dev, I'm getting:
ErrorFromSmartServer: Error received from smart server: ('NoSuchRevision',)
when trying to branch it. So we still have some issues with stacking on
Launchpad.

1) Groupcompress doesn't allow deltas between groups, much less between
pack files, and even less so between repositories. Thats a good thing as
far as stacking is generally concerned, but it means that these
repositories don't have missing compression parents to indicate they are
not complete. Which means certain stacking tests fail, which expect you
to be able to create an uncommittable stream. So for now, I added:

  RepositoryFormat._deltas_across_repos

To indicate when the format doesn't have the capability for deltas
across repos, so we can skip those tests. Does this seem reasonable?


2) The ability to 'fill in' parent inventories is actually being handled
in fetch.RepoFetcher._fetch_everything_for_search. Namely it is doing:
            resume_tokens, missing_keys = self.sink.insert_stream(
                stream, from_format, [])
            if self.to_repository._fallback_repositories:
                missing_keys.update(
                    self._parent_inventories(search.get_keys()))

However, from what I can tell it
  a) Always does this, even for non stacked repositories, meaning we now
     add duplicate inventory texts. Even further, we now add duplicate
     *fulltexts* for these inventories.
     This isn't broken in the sense that you'll miss data when you are
     done, but it *is* broken in the sense that you know copy far more
     data than you need to. I'm guessing this actually might be the
     reason for some of the large fetches I've seen recently. (3 small
     revs now also fetch the complete fulltext chain for all ancestors,
     even though all of those revisions are already present in the
     target repo.)
  b) It depends on the first 'insert_stream' to not actually commit the
     incomplete repository (because it expects there will be missing
     keys due to compression parents). However, for dev6 repositories
     there will be no incomplete texts. Which means that instead we
     commit the required texts for the given revisions, and then 'come
     back later', and fill in the fulltexts for other inventories.

(a) is something that I think we just need a generic fix for. I think it
could be something as simple as:

=== modified file 'bzrlib/fetch.py'
- --- bzrlib/fetch.py     2009-05-01 07:59:02 +0000
+++ bzrlib/fetch.py     2009-05-08 19:19:31 +0000
@@ -181,6 +181,11 @@
         map(parents.update, parent_maps.itervalues())
         parents.discard(NULL_REVISION)
         parents.difference_update(revision_ids)
+        # Now we have the set of parents that are just at the edge for all
+        # of these entries. Cull out parents that are already present
in the
+        # target
+        present_parents = self.to_repository.get_parent_map(parents)
+        parents.difference_update(present_parents)
         missing_keys = set(('inventories', rev_id) for rev_id in parents)
         return missing_keys


Now, for a stacked repository, this actually is still missing something.
Because in that case, we only need to transmit inventories that aren't
present, and we may have already filled in some of them in previous
passes (but did not transmit the revisions themselves).

In that case, we need to map it back into keys, and use:

present_inventory_keys = self.to_repository.inventories.get_parent_map(
	[(p,) for p in parents])
parents.difference_update([k[0] for k in present_inventory_keys])

I'd like Robert or Andrew to confirm, though.

for (b) this is a bit trickier, but it also seems like a generic problem
for current formats. It just 'happens' that we rarely will commit the
data before filling in the extra inventories. But certainly one could
construct a situation where the first insert doesn't have any missing
compression parents, and then a user does ^C at the right time.
(Consider the case of a merge commit that doesn't introduce anything
new, and ends on a delta boundary for the inventory. The new inventory
is a fulltext, but it then has to upload the fulltext for both merge
parents, which have long delta chains, giving plenty of time for ^C.)


Now for stacking dev6, I can take advantage of
'get_stream_for_missing_keys()' to change the logic around inventories
to copy all referenced chk pages. So at least I have a way to detect
when I need to transmit all references, versus just the minimal set.


However, it certainly feels like some of this should be pushed down into
StreamSource/StreamSink, rather than existing in RepoFetcher.

Thoughts?

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkoEiHIACgkQJdeBCYSNAAPX+ACg0CKZNY1RZpcJiX/4sC8aqr9M
peIAoI5lQSeA18Jhw71bmbcnf0SdePd3
=t79A
-----END PGP SIGNATURE-----