Dev6 + stacking

Wed May 13 19:31:52 BST 2009

Sorry to break continuity, but due to power and internet outages, my
mail is a bit messed up right now. In the end, though, it is okay,
because I really wanted to start from scratch. (I also apologize if
the formatting is wonky, I'm sending via gmail, and I'm not sure how
'text mode' works with indenting for lists). Writing the email ended
up revealing some things, so I'll try to summarize properly.

For starters

1) Robert recently changed 'fileids_altered_by' to be a set difference
between the revisions listed, and any parents of those revisions whose
inventories can be found. However, CHKInventoryRepository.fileids_...
was *not* updated to conform to this.

2) This shows that there probably are no direct tests of
fileids_altered_by... under 'ghost' circumstances. It may be
indirectly tested via fetch operations.

This doesn't break CHK repositories, because they don't use
fileids_altered_by to do fetch. The code has inlined the comparison,
so that as it reads the chk pages to analyze what texts (and more chk
pages) to send, it then yields the record so that it can be
transmitted on the wire.

The actual transmission code does effectively the same as (1), so it
should be correct.

However, fileids_altered_by has not been updated to match the
behavior, because it wasn't actually being used anymore.

3) I *could* update fileids_altered... to be built on top of
'iter_interesting_nodes'. However, it will currently fail if attempted
on a stacked branch before we fill in missing parent inventories.
First I'll try to define some scenarios for XML code:

  a) We have not yet filled in missing parent inventories, however the
current inventory texts are fulltexts. Under this condition
fileids_altered_by will return all texts present in the given
revision.
  b) We have not filled in missing parents, and the current inventory
texts are deltas. Here, fileids_altered_by cannot give all text keys,
but can only give the text keys for things modified in this revision.
(all it has is a line-delta.)
     I think one could argue that this is an incorrect result, though
one can also argue that compression parents cannot be ghosts, thus it
is actually implicitly doing a difference between the revision and the
missing parent, as though the parent was actually present and we then
removed the items from it.
     I also think that this could give incorrect results for cases
where one inventory has a file claiming a last modified revision, such
that the inventory for that revision does not also claim the file is
modified. 'inv_Y[file_id].revision == X' but 'inv_X[file_id].revision
!= X'. My quick investigation doesn't show this, so I could be wrong.
  c) We have filled in all parent inventories (sans ghosts). I believe
this walks the deltas for the new inventories, and subtracts out the
set of references from the full inventories of the parents.

Now if we consider this for CHK:
  a) This doesn't quite follow for CHK. All individual texts will be
complete, but the total inventory shape will not be. If we had all
pages for the revisions to transmit, but not the pages for parent
inventories, iter_interesing_nodes will compare 'interesting' with
set(), and will give all text keys present in the revision.
  b) This is closest to the case where we don't have parent inventory
pages, and we only have chk pages that the client determined were
'interesting' for this revision. In this case, iter_interesting_nodes
will break, as it doesn't handle absent references. We will compare
'interesting' with set(), and see a reference to a page we don't have.
  c) With the current fill code, we will fill in the complete chk tree
for parent nodes, which should also mean that we have complete chk
trees for 'transmitted' revisions.

So the problem really is in (b), in that if we don't have a either a
complete target inventory, or enough pages from the parent inventory,
the walking code will fail. One could posit that it is equivalent to
the XML case where all we have is the delta. Under those circumstances
we could just treat 'absent' chk pages as 'implicitly matching some
absent parent', the way we do for XML deltas. Having the page missing
marks it as implicitly uninteresting, otherwise the client would have
uploaded it for us.

I'm hesitant to do this, as it trusts the person who sent me the data
to have given me everything I need, rather than validating it.

4) There is a point that Robert and I seem to be disagreeing on. Most
notably the ordering of:

  a) Transmit the minimally correct set of revisions, inventories,
(chk pages), texts, signatures.
  b) Fill in missing parent inventories, (chk pages,) and compression
parents for inventories and texts.
  c) Check to see if there are any file texts which should be present
but are not. (Based on the difference between the revisions we claim
to have, and the parent inventories which we have available.)

I think we agree that the ordering is good. Where I disagree is that
the check (c) is being done between a + b. The code is:

    insert_stream(get_stream #minimal)
        => missing_keys = self.target_repo.get_missing_parent_inventories()
            => get_parent_map(revisions).difference(get_parent_map(inventories))
            => check fileids_altered_by_revision_ids()
            <= return missing-inventories if fileids missing
        => + get_missing_compression_parents()
        <= return missing_keys
    missing_stream = get_stream_for_missing_keys(missing_keys)
    missing_keys = insert_stream(missing_stream)
            => get_parent_map(revisions).difference(get_parent_map(inventories))
            => check fileids_altered_by_revision_ids()
            <= return missing-inventories if fileids missing
        => + get_missing_compression_parents()
        <= return missing_keys
    if missing_keys: abort

The problem is that fileids_altered_by_revision_ids is done inside the
first insert_stream(), which triggers 3(b) from before.  *As written*,
the chk fetch code needs either:
    a) Enough parent pages to be able to filter out known-common texts
    b) All chk pages for a revision
(Because I don't handle absent pages, and I don't particularly *want* to.)

Now if we changed the code to:
    insert_stream(get_stream #minimal)
        => missing_keys = self.target_repo.get_missing_parent_inventories()
            => get_parent_map(revisions).difference(get_parent_map(inventories))
            <= return missing-inventories
        => + get_missing_compression_parents()
        <= return missing_keys
    missing_stream = get_stream_for_missing_keys(filter_absent(missing_keys))
    missing_keys = insert_stream(missing_stream)
            => get_parent_map(revisions).difference(get_parent_map(inventories))
            => check fileids_altered_by_revision_ids()
            <= return missing-inventories if fileids missing
        => + get_missing_compression_parents()
        <= return missing_keys
    if missing_keys: abort

Then we don't have a problem. fileids_altered_by should be updated to
use the set difference code, but we don't have to support absent pages
that we don't *know* are common.

There are 2 primary changes:
    1) Don't check for missing text keys until the missing inventory
texts have been added
    2) Don't attempt to transmit ghosts for missing inventories

(1) is necessary for the current chk_map.iter_interesting_nodes code,
rather than teaching it to ignore absent records.
(2) is necessary for the case of missing some parent inventories, some
fraction of which are ghosts. (With the current XML code, you can
still have missing texts until the inventories are filled in, but the
inventory request will fail with an AbsentContentFactory because one
of the parents is a real ghost.)

John
=:->