[merge][#288751] avoid creating text deltas spanning repository stacks
Aaron Bentley
aaron at aaronbentley.com
Fri Nov 14 15:30:00 GMT 2008
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
bb:tweak
Fixing this is an immediate benefit. I don't think this does any actual
harm, but there a bunch of issues worth looking at and possibly cleaning up.
Martin Pool wrote:
> This patch fixes a bug which causes some trouble with
> stacked repositories. The design was that text deltas would
> not span repository stacks, for the sake of robustness, and
> so that we can upgrade the inventory representation in
> underlying repositories without causing trouble to the
> stacked repository. However, this was not checked properly.
This looks fine, so far. I think with these changes, "bzr pack" would
fix external references. Should bzr recommend this if it encounters
deltas that span repos?
> This adds has_key(key) to some index(like) objects. I
> realize we don't generally want people checking for one key
> at a time, but in some cases that's all the caller
> reasonably has, and it's better than making them repeat
> boilerplate to use get_parent_map.
I don't mind the existence of the API, but I'm not sure I agree with the
rationale.
Most of our APIs should be dealing with multiple texts, so their
implentations should be
> If this is merged well before 1.10 I think it would be worth putting
> into a 1.9.1 release.
Yes, Launchpad would probably switch to 1.9.1 if you did.
>
> This won't automatically repack existing repositories (unless I'm
> really lucky :-)
But "bzr pack" should correct the issue?
> === modified file 'bzrlib/index.py'
> --- bzrlib/index.py 2008-10-29 21:39:27 +0000
> +++ bzrlib/index.py 2008-11-14 07:44:57 +0000
> @@ -97,6 +97,16 @@
> if not element or _whitespace_re.search(element) is not None:
> raise errors.BadIndexKey(element)
>
> + def _external_references(self):
> + """Return references that are not present in this index.
> + """
> + keys = set()
> + refs = set()
> + for node in self.iter_all_entries():
> + keys.add(node[1])
> + refs.update(node[3][1])
> + return refs - keys
> +
> def _get_nodes_by_key(self):
> if self._nodes_by_key is None:
> nodes_by_key = {}
> @@ -1167,6 +1177,14 @@
> found_parents[key] = parents
> return found_parents
>
> + def has_key(self, key):
> + """Check if this index has one key.
> +
> + If it's possible to check for multiple keys at once through
> + calling get_parent_map that should be faster.
> + """
> + return (key in self.get_parent_map([key]))
I'm surprised to see so many identical implementations of has_key.
Perhaps we could have one implementation and just assign it to the class?
def has_key_from_parent_map(self, key):
return (key in self.get_parent_map([key]))
class CombinedIndex(object):
has_key = has_key_from_parent_map
Personally, I'd refer a bulk operation like:
def missing_keys(self, keys):
map = self.get_parent_map(keys)
return set(keys) - set(map)
That would be more convenient, but still leave room for efficiency.
> === modified file 'bzrlib/knit.py'
> --- bzrlib/knit.py 2008-10-21 03:36:14 +0000
> +++ bzrlib/knit.py 2008-11-14 07:44:57 +0000
> @@ -770,8 +770,9 @@
> present_parents = []
> if parent_texts is None:
> parent_texts = {}
> - # Do a single query to ascertain parent presence.
> - present_parent_map = self.get_parent_map(parents)
> + # Do a single query to ascertain parent presence; we only compress
> + # against parents in the same kvf.
> + present_parent_map = self._index.get_parent_map(parents)
> for parent in parents:
> if parent in present_parent_map:
> present_parents.append(parent)
> @@ -1285,6 +1286,14 @@
> missing.difference_update(set(new_result))
> return result
>
> + def has_key(self, key):
> + """Check if one key is present.
> +
> + If you have multiple keys to check at once, it's better to use
> + get_parent_map.
> + """
> + return key in self.get_parent_map([key])
Why not just implement this on VersionedFiles? That way we wouldn't
need separate implementations for each VFS implementation.
> def insert_record_stream(self, stream):
> """Insert a record stream into this container.
>
> @@ -1330,7 +1339,17 @@
> # Raise an error when a record is missing.
> if record.storage_kind == 'absent':
> raise RevisionNotPresent([record.key], self)
> - if record.storage_kind in knit_types:
> + elif ((record.storage_kind in knit_types)
> + and (not parents
> + or self._index.has_key(parents[0])
> + or not self._fallback_vfs
> + or not self.has_key(parents[0]))):
> + # we can insert the knit record literally if either it has no
> + # compression parent OR we already have its basis in this kvf
> + # OR the basis is not present even in the fallbacks.
Since has_key costs a lot more than bool(self._fallback_vfs), that
should come first.
> + elif ((record.storage_kind in knit_types)
> + and (not parents
> + or not self._fallback_vfs
> + or self._index.has_key(parents[0])
> + or not self.has_key(parents[0]))):
> # In the
> + # last case it will either turn up later in the stream and all
> + # will be well, or it won't turn up at all and we'll raise an
> + # error at the end.
It would be nice to raise here, though. Perhaps some time we can revise
the stream format so we get all its metadata, or at least a list of
keys, up-front.
It also seems like this misses the case where the key is present in the
basis *and* in the stream. Consider keys A B and C, where C has basis
B, and B has basis A. Say we get a stream with C(delta), B(delta),
A(fulltext), in that order. This logic would convert C and B into
fulltexts.
> if record.storage_kind not in native_types:
> try:
> adapter_key = (record.storage_kind, "knit-delta-gz")
> @@ -1358,15 +1377,19 @@
> index_entry = (record.key, options, access_memo, parents)
> buffered = False
> if 'fulltext' not in options:
> - basis_parent = parents[0]
Is this change because you feel parents[0] is clearer than basis_parent?
> + # Not a fulltext, so we need to make sure the parent
> + # versions will also be present.
> # Note that pack backed knits don't need to buffer here
> # because they buffer all writes to the transaction level,
> # but we don't expose that difference at the index level. If
> # the query here has sufficient cost to show up in
> # profiling we should do that.
> - if basis_parent not in self.get_parent_map([basis_parent]):
> + #
> + # They're required to be physically in this
> + # KnitVersionedFiles, not in a fallback.
> + if not self.has_key(parents[0]):
if not self.missing_keys(parents) would also work nicely here.
> pending = buffered_index_entries.setdefault(
> - basis_parent, [])
> + parents[0], [])
> pending.append(index_entry)
> buffered = True
> if not buffered:
> @@ -1375,6 +1398,9 @@
> self.add_lines(record.key, parents,
> split_lines(record.get_bytes_as('fulltext')))
> else:
> + # Not a fulltext, and not suitable for direct insertion as a
> + # delta, either because it's not the right format, or because
> + # it depends on a base only present in the fallback kvfs.
> adapter_key = record.storage_kind, 'fulltext'
> adapter = get_adapter(adapter_key)
> lines = split_lines(adapter.get_bytes(
> @@ -1395,8 +1421,10 @@
> del buffered_index_entries[key]
> # If there were any deltas which had a missing basis parent, error.
> if buffered_index_entries:
> - raise errors.RevisionNotPresent(buffered_index_entries.keys()[0],
> - self)
> + from pprint import pformat
> + raise errors.BzrCheckError(
> + "record_stream refers to compression parents not in %r:\n%s"
> + % (self, pformat(sorted(buffered_index_entries.keys()))))
>
> def iter_lines_added_or_present_in_keys(self, keys, pb=None):
> """Iterate over the lines in the versioned files from keys.
> === modified file 'bzrlib/repofmt/knitrepo.py'
> --- bzrlib/repofmt/knitrepo.py 2008-09-08 14:00:22 +0000
> +++ bzrlib/repofmt/knitrepo.py 2008-11-14 05:09:33 +0000
> @@ -124,9 +124,10 @@
> self._commit_builder_class = _commit_builder_class
> self._serializer = _serializer
> self._reconcile_fixes_text_parents = True
> - self._fetch_uses_deltas = True
> self._fetch_order = 'topological'
>
> + _fetch_uses_deltas = True
> +
What motivated this change? Why change only _fetch_uses_deltas, not the
other members?
> === modified file 'bzrlib/tests/interrepository_implementations/test_fetch.py'
> --- bzrlib/tests/interrepository_implementations/test_fetch.py 2008-08-10 07:00:59 +0000
> +++ bzrlib/tests/interrepository_implementations/test_fetch.py 2008-11-14 07:44:57 +0000
> @@ -101,6 +101,11 @@
> # generally do).
> try:
> to_repo.fetch(tree.branch.repository, 'rev-two')
> + except errors.BzrCheckError, e:
> + # Can also just raise a generic check errors; stream insertion
> + # does this to include all the missing data
> + self.assertRaises((errors.NoSuchRevision, errors.RevisionNotPresent),
> + to_repo.revision_tree, 'rev-two')
Why revision_tree, not get_revision?
Aaron
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iEYEARECAAYFAkkdmXAACgkQ0F+nu1YWqI2kLwCeKU6HAjDaiUrwLEkzJR6ZoH6g
p0wAn2UKDnkQ8/+2FG8b2SewVCfGGkab
=XWib
-----END PGP SIGNATURE-----
More information about the bazaar
mailing list