[MERGE] filtered-deltas - basis for log DIR

Fri Mar 13 05:53:24 GMT 2009

Ian Clatworthy wrote:
> I put up a patch some weeks ago that make log DIR as fast
> as I could for pack repositories....This patch delivers the same
> functionality in a far less intrusive way.

> === modified file 'NEWS'
> --- NEWS	2009-03-12 14:02:53 +0000
> +++ NEWS	2009-03-13 01:59:44 +0000
> @@ -43,10 +43,21 @@
>  
>    API CHANGES:
>  
> +    * New API ``Inventory.filter()`` added that filters an inventory by
> +      a set of file-ids so that only those fileids, their parents and
> +      their children are included.
> +      (Ian Clatworthy)

^^ looks like it would fit on the previous line.

> === modified file 'bzrlib/inventory.py'
> --- bzrlib/inventory.py	2009-03-12 10:04:53 +0000
> +++ bzrlib/inventory.py	2009-03-13 00:29:41 +0000
> @@ -1339,6 +1339,36 @@
>      def is_root(self, file_id):
>          return self.root is not None and file_id == self.root.file_id
>  
> +    def filter(self, specific_fileids):
> +        """Copy an inventory filtering against a set of file-ids.
> +
> +        Children of directories and parents are included.
> +        """
> +        interesting_parents = set()
> +        for fileid in specific_fileids:
> +            try:
> +                interesting_parents.update(self.get_idpath(fileid))
> +            except errors.NoSuchId:
> +                # This fileid is not in the inventory - that's ok
> +                pass
> +        entries = self.iter_entries()
> +        if self.root is None:
> +            return Inventory(root_id=None)
> +        other = Inventory(entries.next()[1].file_id)

This could be spelled as:

    Inventory(root_id=self.root)
    for path, entry in entries:
        if entry.file_id == self.root:
            continue

which I think is a bit cleaner.  But either is fine.

        if self.root is None:
> +            return Inventory(root_id=None)
> +        other = Inventory(entries.next()[1].file_id)

> +        other.root.revision = self.root.revision
> +        other.revision_id = self.revision_id
> +        directories_to_expand = set()
> +        for path, entry in entries:
> +            file_id = entry.file_id
> +            if (file_id in specific_fileids
> +                or entry.parent_id in directories_to_expand):
> +                if entry.kind == 'directory':
> +                    directories_to_expand.add(file_id)

I don't think this is correct.  How are you ensuring that grandparents
and great-grandparents, etc. are being retained?

bb:comment until this issue is resolved.

> +            elif file_id not in interesting_parents:
> +                continue
> +            other.add(entry.copy())

It seems strange for an optimization to be doing a copy.

> === modified file 'bzrlib/repository.py'
> --- bzrlib/repository.py	2009-03-12 02:45:17 +0000
> +++ bzrlib/repository.py	2009-03-13 01:59:44 +0000
> @@ -1313,19 +1313,38 @@
>          rev_tmp.seek(0)
>          return rev_tmp.getvalue()
>  
> -    def get_deltas_for_revisions(self, revisions):
> +    def get_deltas_for_revisions(self, revisions, specific_fileids=None):
>          """Produce a generator of revision deltas.
>  
>          Note that the input is a sequence of REVISIONS, not revision_ids.
>          Trees will be held in memory until the generator exits.
>          Each delta is relative to the revision's lefthand predecessor.
> +
> +        :param specific_fileids: if not None, the result is filtered
> +          so that only those file-ids, their parents and their
> +          children are included.
>          """
> +        # Get the revision-ids of interest
>          required_trees = set()
>          for revision in revisions:
>              required_trees.add(revision.revision_id)
>              required_trees.update(revision.parent_ids[:1])
> -        trees = dict((t.get_revision_id(), t) for
> -                     t in self.revision_trees(required_trees))
> +
> +        # Get the matching filtered trees. Note that it's more
> +        # efficient to pass filtered trees to changes_from() rather
> +        # than doing the filtering afterwards. changes_from() could
> +        # arguably do the filtering itself but it's path-based, not
> +        # file-id based, so filtering before or afterwards is
> +        # currently easier.

Filtering in changes_from will be more efficient, and I think that's
ultimately the right approach.  But a direct implementation of
get_deltas_for_revisions directly on CHK repositories would make a lot
of sense.

Aaron