[MERGE] Allow stacking for --dev6

Tue May 12 22:11:24 BST 2009

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

...

>> This does:
>>
>>   keys = [key for key, parents in graph.iter_ancestry(heads)
>>           if key != NULL_REVISION]
> 
> (Specifically, PendingAncestryResult's get_keys does this.  SearchResult's
> get_keys just has a pre-calculated list that it was constructed with.)
> 
>> However, that ignores the fact that 'iter_ancestry()' returns ghosts.
>> You can find them by checking "parents is None".
> 
> Yes, I've been surprised by this too.  I think it would be reasonable to add
> “and parents is not None” to that filter.  In fact I'm fairly sure I
> considered doing that at one point, I can't remember why I didn't... perhaps
> just because I didn't need to?

So I've done this. And for now, I'm just adding a call to
get_parent_map() rather than tracking 'excluded_keys' in the Result
objects. However, I've run into yet-another thing.

Specifically, Andrew's new _KeyRefs and associated
get_missing_parent_inventories code.

His code uses _KeyRefs to find the 'edge' inventories. The inventories
*in* the stacked repository, whose parent revisions are not in the
repository, and their inventories are not in the repository. All of that
seems to have been translated across fine.

His next step is to then call 'fileids_altered_by_revision_ids' on
'edge' revisions. And if there are any text keys which are mentioned
that are not present, then we must attempt to fetch the inventories.

However, there are a few assumptions made, which I don't think are
actually relevant. The biggest one, is that
'Repository.fileids_altered_by_revision_ids'  (for XML inventories) will
return fileids that *aren't* altered by the revision ids you pass, if
there happens to be a fulltext there.

Consider the case:

rev-1	add and commit foo-id
rev-2   add bar-id
rev-3   modify bar-id
	# The inventory is a fulltext here, because the size of the
	# delta chain was bigger than the size of a fulltext.

If we call "fileids_altered_by_revision_ids(['rev-3'])" for XML
repositories it will return:

 {'bar-id': ['rev-3'], 'foo-id': ['rev-1']}

My personal issue is that ['rev-3'] did *not* modify 'foo-id'. It turns
out that the CHK implementation explicitly does:

 for entry in inv.iter_just_entries():
    if entry.revision != inv.revision_id:
      continue

In other words, it walks every inventory, and all file-ids, and only
entries that are *modified in that revision* are included. Which seems
to perfectly follow the contract of 'fileids_altered_by_revision_ids'.

Looking at it a different way, if 'rev-3' was a delta we could be
missing the compression parent, but the
'get_missing_parent_inventories()' path would actually say that you
don't need to fill in the parent inventory. Though that key would be
filled in by a different code path. But it isn't relevant for CHK
repositories, as they don't (yet) have compression parents that can be
'missing'.

So I think we need to rethink how Andrew's "fix" works. I believe the
idea was:

1) We need to have enough inventory data, so that we can compare the set
of parent revisions versus the revision-to-transmit, and figure out what
text keys also need to be transmitted.

2) As a shortcut to (1) we just ensure that the inventories of all
parents of transmitted revisions are inserted. It started failing
because of entries that are genuine 'ghosts', because we can't transmit
inventories for those.

3) So Andrew's idea was to change it around a bit. Do the set difference
of file keys ourselves (using fileids_altered_by...) and see if our
guess at (1) would include any texts that we don't actually have. If it
does, then he assumes that filling in extra inventories will resolve that.

4) I think I have a test case that would break his code. Namely:

builder = self.make_branch_builder('source')
builder.build_snapshot('A-id', None, [
	(# add root),
	(# add f-id), (# add g-id)])
builder.build_snapshot('B-id', ['A-id'], [
	(# modify f-id)])
builder.build_snapshot('C-id', ['B-id', 'ghost-id'], [
	(# modify g-id)])
source_branch = builder.get_branch()
branch_base = self.make_branch('stack-on')
branch_base.pull(source_branch, stop_revision='B-id')
branch_stacked = self.make_branch('stacked')
branch_stacked.set_stacked_location('../stack-on')
branch_stacked.pull(source_branch, stop_revision='C-id')

The issue is in this final 'pull'. Specifically, it should want to
transmit just 'C-id' to 'stacked', and then fill in the inventory for B-id.

However, when it does the check for missing parents, it will see 'B-id'
*and* 'ghost-id'. When it then checks for missing text keys, it will see
that ('f-id', 'B-id') is not present [and *shouldn't* be]. It will then
request that inventory for both 'f-id', *and* 'ghost-id' be transmitted.
And it will fail to transmit 'ghost-id' for obvious reasons.

(Of course, this is assuming the XML behavior for fileids_altered...
which includes texts not actually in those revs.)

5) Now, the good of what Andrew wrote is that 'fileids_altered...' is
the code that determines what texts to send for XML repositories. Thus
it makes sense to use it here, to see if we would try to send texts that
we can't. The main problem is actually just the 'all-or-nothing' of
inventory texts. Because that fails when you have both genuinely missing
parents and ghosts in the same fetch. Then again, maybe the
get_stream_for_missing_keys is allowed to not transmit an requested
entry, and the next 'what is missing' check will realize it didn't
really need that key after all. (From what I can tell, the stream will
be requested, and it will include an AbsentContentFactory, and something
between the request and the final insert will blow-up and die.)

I also remember now that Robert explicitly changed fileids_altered... to
handle 'ghosts', which is why it now includes keys that aren't actually
modified in the revisions in question. (This was done, IIRC, to handle
stuff like bzr-svn, where the last-modified revision of a file would be
from the bzr source, but that revision would not be present in the svn
repository.)

So perhaps it is more of a bug that CHK repositories don't have
fileids_altered... match what they would transmit. Except CHK
repositories (as currently implemented) require to have complete
inventories for the revisions present. To *do* this, we transmit just
the modified chk pages for the revisions requested, and then the
missing_keys code requests *all* chk pages of the parent inventories.

The issue is that until we have a complete inventory, we *can't* use the
CHK code for determining what texts to send, because it assumes you have
a complete inventory. (Which is why we are doing the filling in the
first place.) It is "safe" in the presence of ghosts, because the
original 'what chk pages are modified' will include pages that may have
originally belonged to a ghost, and the rest will be transmitted for the
'missing_keys' check...

I could update the CHK code to be safe in the present of absent CHK
pages, but I'd rather not do that.

So I feel like what I'm saying is that the code should say:

1) Insert the minimal amount of data into the stacked repository,
pretending that it is a full repository.

2) Ask for all keys that would make the stacked repository "complete
enough" (This is the compression parents of texts and basis inventories
to allow the text-key computation.)
This request is allowed to be incomplete, due to things like genuine ghosts.

3) Run another check to see if a 'get_stream()' request would succeed
for any of the new revisions. This is stuff like doing fileids_... to
check to see if all texts referenced by the 'present revisions' are also
present. If this fails, abort.

I guess the main change breaks down into:
a) Restore the request for all missing inventories
b) Change (a) so that it skips ghost entries
c) Move the check for missing text keys out of
get_missing_parent_inventories() into something that runs after (a/b)
has finished inserting the inventories of interest.

The important thing is that the check is done *after* filling things in.
As it stands the check assumes that either it has all needed inventories
or it has none, and can't handle the case of needing 1 out of 2 of the
missing inventories.

Does this seem like a reasonable reworking of the code?

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkoJ5fwACgkQJdeBCYSNAAPaYgCg1MOUibWDYnFDtfkSv8zZERCn
6jQAnj4RYr/LHt2IgnBNlz7JSF68MGDE
=FW7I
-----END PGP SIGNATURE-----
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: 1.15-gc-stacking.patch
Url: https://lists.ubuntu.com/archives/bazaar/attachments/20090512/86112ed7/attachment-0001.diff