[merge][#288751] avoid creating text deltas spanning repository stacks

Fri Nov 14 15:30:00 GMT 2008

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

bb:tweak

Fixing this is an immediate benefit.  I don't think this does any actual
harm, but there a bunch of issues worth looking at and possibly cleaning up.

Martin Pool wrote:
> This patch fixes a bug which causes some trouble with
> stacked repositories.  The design was that text deltas would
> not span repository stacks, for the sake of robustness, and
> so that we can upgrade the inventory representation in
> underlying repositories without causing trouble to the
> stacked repository.  However, this was not checked properly.

This looks fine, so far.  I think with these changes, "bzr pack" would
fix external references.  Should bzr recommend this if it encounters
deltas that span repos?

> This adds has_key(key) to some index(like) objects.  I
> realize we don't generally want people checking for one key
> at a time, but in some cases that's all the caller
> reasonably has, and it's better than making them repeat
> boilerplate to use get_parent_map.

I don't mind the existence of the API, but I'm not sure I agree with the
rationale.

Most of our APIs should be dealing with multiple texts, so their
implentations should be

> If this is merged well before 1.10 I think it would be worth putting
> into a 1.9.1 release.

Yes, Launchpad would probably switch to 1.9.1 if you did.

> 
> This won't automatically repack existing repositories (unless I'm
> really lucky :-)

But "bzr pack" should correct the issue?

> === modified file 'bzrlib/index.py'
> --- bzrlib/index.py	2008-10-29 21:39:27 +0000
> +++ bzrlib/index.py	2008-11-14 07:44:57 +0000
> @@ -97,6 +97,16 @@
>              if not element or _whitespace_re.search(element) is not None:
>                  raise errors.BadIndexKey(element)
>  
> +    def _external_references(self):
> +        """Return references that are not present in this index.
> +        """
> +        keys = set()
> +        refs = set()
> +        for node in self.iter_all_entries():
> +            keys.add(node[1])
> +            refs.update(node[3][1])
> +        return refs - keys
> +
>      def _get_nodes_by_key(self):
>          if self._nodes_by_key is None:
>              nodes_by_key = {}
> @@ -1167,6 +1177,14 @@
>              found_parents[key] = parents
>          return found_parents
>  
> +    def has_key(self, key):
> +        """Check if this index has one key.
> +
> +        If it's possible to check for multiple keys at once through 
> +        calling get_parent_map that should be faster.
> +        """
> +        return (key in self.get_parent_map([key]))

I'm surprised to see so many identical implementations of has_key.
Perhaps we could have one implementation and just assign it to the class?

def has_key_from_parent_map(self, key):
    return (key in self.get_parent_map([key]))

class CombinedIndex(object):

    has_key = has_key_from_parent_map

Personally, I'd refer a bulk operation like:

def missing_keys(self, keys):
    map = self.get_parent_map(keys)
    return set(keys) - set(map)

That would be more convenient, but still leave room for efficiency.

> === modified file 'bzrlib/knit.py'
> --- bzrlib/knit.py	2008-10-21 03:36:14 +0000
> +++ bzrlib/knit.py	2008-11-14 07:44:57 +0000
> @@ -770,8 +770,9 @@
>          present_parents = []
>          if parent_texts is None:
>              parent_texts = {}
> -        # Do a single query to ascertain parent presence.
> -        present_parent_map = self.get_parent_map(parents)
> +        # Do a single query to ascertain parent presence; we only compress
> +        # against parents in the same kvf.
> +        present_parent_map = self._index.get_parent_map(parents)
>          for parent in parents:
>              if parent in present_parent_map:
>                  present_parents.append(parent)
> @@ -1285,6 +1286,14 @@
>              missing.difference_update(set(new_result))
>          return result
>  
> +    def has_key(self, key):
> +        """Check if one key is present.
> +
> +        If you have multiple keys to check at once, it's better to use
> +        get_parent_map.
> +        """
> +        return key in self.get_parent_map([key])

Why not just implement this on VersionedFiles?  That way we wouldn't
need separate implementations for each VFS implementation.

>      def insert_record_stream(self, stream):
>          """Insert a record stream into this container.
>  
> @@ -1330,7 +1339,17 @@
>              # Raise an error when a record is missing.
>              if record.storage_kind == 'absent':
>                  raise RevisionNotPresent([record.key], self)
> -            if record.storage_kind in knit_types:
> +            elif ((record.storage_kind in knit_types) 
> +                  and (not parents
> +                       or self._index.has_key(parents[0])
> +                       or not self._fallback_vfs
> +                       or not self.has_key(parents[0]))):
> +                # we can insert the knit record literally if either it has no
> +                # compression parent OR we already have its basis in this kvf
> +                # OR the basis is not present even in the fallbacks.

Since has_key costs a lot more than bool(self._fallback_vfs), that
should come first.

> +            elif ((record.storage_kind in knit_types)
> +                  and (not parents
> +                       or not self._fallback_vfs
> +                       or self._index.has_key(parents[0])
> +                       or not self.has_key(parents[0]))):

>                  # In the
> +                # last case it will either turn up later in the stream and all
> +                # will be well, or it won't turn up at all and we'll raise an
> +                # error at the end.

It would be nice to raise here, though.  Perhaps some time we can revise
the stream format so we get all its metadata, or at least a list of
keys, up-front.

It also seems like this misses the case where the key is present in the
basis *and* in the stream.  Consider keys A B and C, where C has basis
B, and B has basis A.  Say we get a stream with C(delta), B(delta),
A(fulltext), in that order.  This logic would convert C and B into
fulltexts.

>                  if record.storage_kind not in native_types:
>                      try:
>                          adapter_key = (record.storage_kind, "knit-delta-gz")
> @@ -1358,15 +1377,19 @@
>                  index_entry = (record.key, options, access_memo, parents)
>                  buffered = False
>                  if 'fulltext' not in options:
> -                    basis_parent = parents[0]

Is this change because you feel parents[0] is clearer than basis_parent?

> +                    # Not a fulltext, so we need to make sure the parent
> +                    # versions will also be present.  
>                      # Note that pack backed knits don't need to buffer here
>                      # because they buffer all writes to the transaction level,
>                      # but we don't expose that difference at the index level. If
>                      # the query here has sufficient cost to show up in
>                      # profiling we should do that.
> -                    if basis_parent not in self.get_parent_map([basis_parent]):
> +                    # 
> +                    # They're required to be physically in this
> +                    # KnitVersionedFiles, not in a fallback.
> +                    if not self.has_key(parents[0]):

if not self.missing_keys(parents) would also work nicely here.

>                          pending = buffered_index_entries.setdefault(
> -                            basis_parent, [])
> +                            parents[0], [])
>                          pending.append(index_entry)
>                          buffered = True
>                  if not buffered:
> @@ -1375,6 +1398,9 @@
>                  self.add_lines(record.key, parents,
>                      split_lines(record.get_bytes_as('fulltext')))
>              else:
> +                # Not a fulltext, and not suitable for direct insertion as a
> +                # delta, either because it's not the right format, or because
> +                # it depends on a base only present in the fallback kvfs.
>                  adapter_key = record.storage_kind, 'fulltext'
>                  adapter = get_adapter(adapter_key)
>                  lines = split_lines(adapter.get_bytes(
> @@ -1395,8 +1421,10 @@
>                      del buffered_index_entries[key]
>          # If there were any deltas which had a missing basis parent, error.
>          if buffered_index_entries:
> -            raise errors.RevisionNotPresent(buffered_index_entries.keys()[0],
> -                self)
> +            from pprint import pformat
> +            raise errors.BzrCheckError(
> +                "record_stream refers to compression parents not in %r:\n%s"
> +                % (self, pformat(sorted(buffered_index_entries.keys()))))
>  
>      def iter_lines_added_or_present_in_keys(self, keys, pb=None):
>          """Iterate over the lines in the versioned files from keys.

> === modified file 'bzrlib/repofmt/knitrepo.py'
> --- bzrlib/repofmt/knitrepo.py	2008-09-08 14:00:22 +0000
> +++ bzrlib/repofmt/knitrepo.py	2008-11-14 05:09:33 +0000
> @@ -124,9 +124,10 @@
>          self._commit_builder_class = _commit_builder_class
>          self._serializer = _serializer
>          self._reconcile_fixes_text_parents = True
> -        self._fetch_uses_deltas = True
>          self._fetch_order = 'topological'
>  
> +    _fetch_uses_deltas = True
> +

What motivated this change?  Why change only _fetch_uses_deltas, not the
other members?

> === modified file 'bzrlib/tests/interrepository_implementations/test_fetch.py'
> --- bzrlib/tests/interrepository_implementations/test_fetch.py	2008-08-10 07:00:59 +0000
> +++ bzrlib/tests/interrepository_implementations/test_fetch.py	2008-11-14 07:44:57 +0000
> @@ -101,6 +101,11 @@
>          # generally do).
>          try:
>              to_repo.fetch(tree.branch.repository, 'rev-two')
> +        except errors.BzrCheckError, e:
> +            # Can also just raise a generic check errors; stream insertion
> +            # does this to include all the missing data
> +            self.assertRaises((errors.NoSuchRevision, errors.RevisionNotPresent),
> +                              to_repo.revision_tree, 'rev-two')

Why revision_tree, not get_revision?

Aaron
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkkdmXAACgkQ0F+nu1YWqI2kLwCeKU6HAjDaiUrwLEkzJR6ZoH6g
p0wAn2UKDnkQ8/+2FG8b2SewVCfGGkab
=XWib
-----END PGP SIGNATURE-----