[MERGE] Packs. Kthxbye.

Thu Oct 18 06:43:42 BST 2007

Robert Collins wrote:
> I think the moment has come to get the pack repository format reviewed.

Thanks for the changes pushed earlier today. Some more comments below.

> -        for version in topo_sort(new_parents.items()):
> +        for version in TopoSorter(new_parents.items()).iter_topo_order():

IIUC, this is a performance enhancement quite separate to the pack
changes. Might be good to put this one up separately to get it through
faster, given there's a bit of reconciling going on right now.

> === modified file 'bzrlib/repository.py'

> -    def fileids_altered_by_revision_ids(self, revision_ids):
> -        """Find the file ids and versions affected by revisions.
> -
> -        :param revisions: an iterable containing revision ids.
> +    def _find_file_ids_from_xml_inventory_lines(self, line_iterator,
> +        revision_ids):
> +        """Helper routine for fileids_altered_by_revision_ids.
> +
> +        This performs the translation of xml lines to revision ids.
> +
> +        :param line_iterator: An iterator of lines
> +        :param revision_ids: The revision ids to filter for.

I'm good with this particular refactoring - it does make the code
clearer IMO.

> +        # TODO: jam 20070210 This should be an assert, not a translate
> +        revision_id = osutils.safe_revision_id(revision_id)

I think these two lines can go. I'm guessing this code was copied before
we went through and removed most of the safe_revision_id calls?

> +        self.count_copied = 0

This looks unnecessary. It might be harmless but it's not clear why it
required here given fetch.py is the one place that sets this attribute
for other InterRepo fetches.

> +    @needs_read_lock
> +    def missing_revision_ids(self, revision_id=None):
> +        """See InterRepository.missing_revision_ids()."""
> +        if revision_id is not None:
> +            source_ids = self.source.get_ancestry(revision_id)
> +            assert source_ids[0] is None
> +            source_ids.pop(0)
> +        else:
> +            source_ids = self.source.all_revision_ids()
> +        source_ids_set = set(source_ids)
> +        # source_ids is the worst possible case we may need to pull.
> +        # now we want to filter source_ids against what we actually
> +        # have in target, but don't try to check for existence where we know
> +        # we do not have a revision as that would be pointless.
> +        target_ids = set(self.target.all_revision_ids())
> +        actually_present_revisions = target_ids.intersection(source_ids_set)
> +        required_revisions = source_ids_set.difference(actually_present_revisions)
> +        required_topo_revisions = [rev_id for rev_id in source_ids if rev_id in required_revisions]
> +        return required_topo_revisions

It painful how "generic" this code looks and yet it varies is subtle
ways from other implementations for reasons that aren't clearly repo
specific. Comparing it with InterKnitRepo for example, why do we need to
do the possibly_present_revisions vs actually_present_revisions dance
there but not here? For ease of reference, here's the code snippet I'm
referring to:

        target_ids = set(self.target.all_revision_ids())
        possibly_present_revisions = target_ids.intersection(source_ids_set)
        actually_present_revisions =
set(self.target._eliminate_revisions_not_present(possibly_present_revisions))

>  class InterModel1and2(InterRepository):
>  
>      @classmethod
> @@ -2306,10 +2443,15 @@
>          """Be compatible with Knit1 source and Knit3 target"""
>          from bzrlib.repofmt.knitrepo import RepositoryFormatKnit3
>          try:
> -            from bzrlib.repofmt.knitrepo import RepositoryFormatKnit1, \
> -                    RepositoryFormatKnit3
> -            return (isinstance(source._format, (RepositoryFormatKnit1)) and
> -                    isinstance(target._format, (RepositoryFormatKnit3)))
> +            from bzrlib.repofmt.knitrepo import (RepositoryFormatKnit1,
> +                RepositoryFormatKnit3)
> +            from bzrlib.repofmt.pack_repo import (RepositoryFormatGraphKnit1,
> +                RepositoryFormatGraphKnit3)
> +            return (isinstance(source._format, RepositoryFormatKnit1) and
> +                    isinstance(target._format, RepositoryFormatKnit3) or
> +                    isinstance(source._format, (RepositoryFormatGraphKnit1)) and
> +                    isinstance(target._format, (RepositoryFormatGraphKnit3))
> +                    )

Any reason why Knit1->Knit3 is fine and GraphKnit1->GraphKnit3 is fine
but Knit1->GraphKnit3 (for example) is not?

So, to let you know where I'm up to overall:

* yet to complete reading pack_repo.py
* yet to read the new tests added to test_repository.py
* need to think about the change to knitrepo.py more
* otherwise I'm all happy. :-)

Well, happy when:

1. I complete the above
2. A name for the new format is agreed (pack sounds fine to me) and
   that name is propagated to pack_repo.py, bzrdir.py and repository.py.
3. NEWS gets updated.

I'm sure there are plenty of issues I'm yet to consider though so it's
good to see other's commenting and reviewing as well.

Ian C.