[MERGE] Allow stacking for --dev6

Wed May 13 05:19:03 BST 2009

On Tue, 2009-05-12 at 16:11 -0500, John Arbash Meinel wrote:
> 
> In other words, it walks every inventory, and all file-ids, and only
> entries that are *modified in that revision* are included. Which seems
> to perfectly follow the contract of 'fileids_altered_by_revision_ids'.

Thats a bug then.

We have inventories in bzr itself where 
inv['fileid'].revision != inv.revision
AND
inv['fileid'].revision not in [i['fileid'].revision for i in
inventories_of_parents_of_inv.revision]

Thats why the fileids_altered_by in packs now does a set difference. We
have to send everything thats newly referenced, and should ignore
the .revision except for comparing it to that ie in previous versions.

> Looking at it a different way, if 'rev-3' was a delta we could be
> missing the compression parent, but the
> 'get_missing_parent_inventories()' path would actually say that you
> don't need to fill in the parent inventory. Though that key would be
> filled in by a different code path. But it isn't relevant for CHK
> repositories, as they don't (yet) have compression parents that can be
> 'missing'.

This isn't about compression parents. Its about making sure we have
enough data to send a sensible stream, always, for all revisions in the
repository.

> So I think we need to rethink how Andrew's "fix" works. I believe the
> idea was:
> 
> 1) We need to have enough inventory data, so that we can compare the
> set
> of parent revisions versus the revision-to-transmit, and figure out
> what
> text keys also need to be transmitted.

Yes.

> 2) As a shortcut to (1) we just ensure that the inventories of all
> parents of transmitted revisions are inserted. It started failing
> because of entries that are genuine 'ghosts', because we can't
> transmit
> inventories for those.

It wasn't intended as a shortcut; for packs its the only sane
definition :).

> 3) So Andrew's idea was to change it around a bit. Do the set
> difference
> of file keys ourselves (using fileids_altered_by...) and see if our
> guess at (1) would include any texts that we don't actually have. If
> it
> does, then he assumes that filling in extra inventories will resolve
> that.

No, we do the set difference on client and server as insurance; see
previous mails in the thread.

> 4) I think I have a test case that would break his code. Namely:
> 
> builder = self.make_branch_builder('source')
> builder.build_snapshot('A-id', None, [
>         (# add root),
>         (# add f-id), (# add g-id)])
> builder.build_snapshot('B-id', ['A-id'], [
>         (# modify f-id)])
> builder.build_snapshot('C-id', ['B-id', 'ghost-id'], [
>         (# modify g-id)])
> source_branch = builder.get_branch()
> branch_base = self.make_branch('stack-on')
> branch_base.pull(source_branch, stop_revision='B-id')
> branch_stacked = self.make_branch('stacked')
> branch_stacked.set_stacked_location('../stack-on')
> branch_stacked.pull(source_branch, stop_revision='C-id')
> 
> The issue is in this final 'pull'. Specifically, it should want to
> transmit just 'C-id' to 'stacked', and then fill in the inventory for
> B-id.
> 
> However, when it does the check for missing parents, it will see
> 'B-id'
> *and* 'ghost-id'. When it then checks for missing text keys, it will
> see
> that ('f-id', 'B-id') is not present [and *shouldn't* be]. It will
> then
> request that inventory for both 'f-id', *and* 'ghost-id' be
> transmitted.
> And it will fail to transmit 'ghost-id' for obvious reasons.

I think you mean 'inventory for both B-id and ghost-id'. Which is fine.
And failing to send ghost-id is also fine.

> (Of course, this is assuming the XML behavior for fileids_altered...
> which includes texts not actually in those revs.)
> 
> 5) Now, the good of what Andrew wrote is that 'fileids_altered...' is
> the code that determines what texts to send for XML repositories. Thus
> it makes sense to use it here, to see if we would try to send texts
> that
> we can't. The main problem is actually just the 'all-or-nothing' of
> inventory texts. Because that fails when you have both genuinely
> missing
> parents and ghosts in the same fetch. Then again, maybe the
> get_stream_for_missing_keys is allowed to not transmit an requested
> entry, and the next 'what is missing' check will realize it didn't
> really need that key after all. (From what I can tell, the stream will
> be requested, and it will include an AbsentContentFactory, and
> something
> between the request and the final insert will blow-up and die.)

Thats right. The server advises us what it wants, and the client will
try and fail to find 'ghost-id's inventory. The server when it then
checks will see no missing texts, and ignore the fact that ghost-id's
inventory is not present.

> I also remember now that Robert explicitly changed fileids_altered...
> to
> handle 'ghosts', which is why it now includes keys that aren't
> actually
> modified in the revisions in question. (This was done, IIRC, to handle
> stuff like bzr-svn, where the last-modified revision of a file would
> be
> from the bzr source, but that revision would not be present in the svn
> repository.)

Actually, it helps bzr-svn, but the main thing was to fix 'bzr pull
bzr.dev' in a new dev6 repository. We have data that fails one of our
current constraints. And its simpler and safer to define it as 'all new
references' rather than 'references that claim to be introduced in
revisions SET'.

> So perhaps it is more of a bug that CHK repositories don't have
> fileids_altered... match what they would transmit.

I think so.

>  Except CHK
> repositories (as currently implemented) require to have complete
> inventories for the revisions present. To *do* this, we transmit just
> the modified chk pages for the revisions requested, and then the
> missing_keys code requests *all* chk pages of the parent inventories.

We can be finer grained there, if you consider two CHK inventories:
A -> [A1, A2], B ->[A1, B1], we don't need to send up the A1 page - just
sending up A's root and the A2 page will allow a complete delta to be
calculated. As long as we never read pages that are duplicated (which
iter_changes is safe on; I don't know about the similar code in the
repository layer - its another reason I want to delete the overlap.

For now though, as I've said previous - just send up the entire
inventory for the adjacent revisions. It's larger than ideal, but not
much worse than current pack stacking, if not actually better because of
page sharing.

> The issue is that until we have a complete inventory, we *can't* use
> the
> CHK code for determining what texts to send, because it assumes you
> have
> a complete inventory. (Which is why we are doing the filling in the
> first place.) It is "safe" in the presence of ghosts, because the
> original 'what chk pages are modified' will include pages that may
> have
> originally belonged to a ghost, and the rest will be transmitted for
> the
> 'missing_keys' check...
>
> I could update the CHK code to be safe in the present of absent CHK
> pages, but I'd rather not do that.
> 
> So I feel like what I'm saying is that the code should say:
> 
> 1) Insert the minimal amount of data into the stacked repository,
> pretending that it is a full repository.

This confuses me, because its not a full repository. If you mean
'transfer new data introduced by the commits being pushed' - sure.

> 2) Ask for all keys that would make the stacked repository "complete
> enough" (This is the compression parents of texts and basis
> inventories
> to allow the text-key computation.)
> This request is allowed to be incomplete, due to things like genuine
> ghosts.

This is precisely what the current sink return code does.

> 3) Run another check to see if a 'get_stream()' request would succeed
> for any of the new revisions. This is stuff like doing fileids_... to
> check to see if all texts referenced by the 'present revisions' are
> also
> present. If this fails, abort.

And this is what it does.

> 
> I guess the main change breaks down into:
> a) Restore the request for all missing inventories

Uh, I don't think its gone. Can you point me at some code?

> b) Change (a) so that it skips ghost entries
> c) Move the check for missing text keys out of
> get_missing_parent_inventories() into something that runs after (a/b)
> has finished inserting the inventories of interest.
> 
> The important thing is that the check is done *after* filling things
> in.
> As it stands the check assumes that either it has all needed
> inventories
> or it has none, and can't handle the case of needing 1 out of 2 of the
> missing inventories.
> 
> Does this seem like a reasonable reworking of the code?

I think you're misunderstanding something. I'd really like more
bandwidth with you to step through it. Andrew and I have been inspecting
all the changes we've made in this area with a keen eye towards bbc
support, and I kindof feel like there is something you're missing -
because we walked through bbc support explicitly when designing this.

-Rob
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20090513/968aa047/attachment.pgp