Incorrect tracking of missing parents?

John Arbash Meinel john at arbash-meinel.com
Fri May 8 17:26:20 BST 2009


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I'm looking at implementing stacking for -dev6 repositories, and I found
out that I need to implement _GCIndex.get_missing_parents, which is
implemented in Knits by setting a 'track parents' flag.

When I looked at it, I saw that it does this:
                if parent_refs is not None:
                    parent_refs.update(parents)
                    parent_refs.discard(key)

However, this is fairly likely to generate a lot of false positives.
Consider the case of:

 insert A => []
 insert B => [A]
 insert C => [B]
 insert D => [C]

In that case, you will discard A, but then add it back as part of adding
B. B is discarded right away (no-op), but then added when adding C.

Now, I know that the final result then does difference update:
    def get_missing_parents(self):
        """Return the keys of missing parents."""
        # We may have false positives, so filter those out.
        self._external_parent_refs.difference_update(
            self.get_parent_map(self._external_parent_refs))
        return frozenset(self._external_parent_refs)

Now, we have to do the final set difference, because we aren't checking
un-accessed indexes yet. (The stacked location could already have those
parents from a previous insert.)

I just get the feeling that the parent_refs.discard() from above isn't
really helping, and actually keeping track via 2 sets might be a lot
better. This is also true for

    def scan_unvalidated_index(self, graph_index):
...
        if self._external_parent_refs is not None:
            # Add parent refs from graph_index (and discard parent refs that
            # the graph_index has).
            for node in graph_index.iter_all_entries():
                self._external_parent_refs.update(node[3][0])
                self._external_parent_refs.discard(node[1])

^- Again, depending on the ordering here, the 'discard()' will be
discarding everything *before* it gets added. I think it would be better
to do:

  if self._external_parent_refs is not None:
    new_refs = set()
    existing_refs = set()
    for node in graph_index.iter_all_entries():
      new_refs.update(node[3][0])
      existing_refs.add(node[1])
   self._external_parent_refs.update(new_refs)
   self._external_parent_refs = \
        self._external_parent_refs.difference(existing_refs)
   # diff_update is probably fine, since we know new_refs is generally >
   # than existing_refs.


To 'get it done', I'm starting with the bad form of just walking
'iter_all_entries()' over everything, but I figure I should make sure I
understand the existing algorithm, and how to bring it across.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkoEXSsACgkQJdeBCYSNAAPjEwCfSSaE4AvLrtLFkeCV0/7begyi
jjUAoIUBZ48vpfVXEjX2otBWOa5n2kSe
=4MIJ
-----END PGP SIGNATURE-----



More information about the bazaar mailing list