[MERGE] Update to bzr.dev.

Martin Pool mbp at canonical.com
Fri Jun 13 08:10:54 BST 2008


I haven't finished everything by any means, but here are some partial comments.

Great ratio of addition to deletion, and the new code is pretty clear and clean.

I generally indexing it by tuples is pretty good, but I am a bit disturbed by
the new prevalence of code like this::

+            # Adapt to tuple interface:
+            if len(key) == 2:
+                prefix = key[:1]
+            else:
+                prefix = ()

All this slicing of tuples is just a bit unclear and can be wrong; particularly
when the same code deals with tuples of different lengths, and because things
can be so confusing if a string is passed where there should be a tuple.

I see Andrew also mentioned this so maybe we should do something about it. :-)

I'm not sure what to do.  Some of it is endemic to Python.  Perhaps we could try
to get everything to at least use the same tuple length for keys, so that at
least we can unpack it through "a, b, c = key" and avoid conditionals.

=== modified file 'NEWS'
--- NEWS	2008-06-11 04:20:56 +0000
+++ NEWS	2008-06-11 07:22:00 +0000
@@ -33,6 +33,10 @@

   INTERNALS:

+    * New ``versionedfile.KeyMapper`` interface to abstract out the access to
+      underyling .knit/.kndx etc files in repositories with partitioned
+      storage. (Robert Collins)
+

Should also mention

  Developer-use command weave-join is removed.

  Removed Repostiory.text_store, control_store, (others?)

  This adds to Repository a few new public attributes to Repository's public
  interface: .inventories, revisions, signatures, texts.  These are each
  VersionedFiles.





 bzr 1.6beta2 2008-06-10
 -----------------------

=== modified file 'bzrlib/annotate.py'
--- bzrlib/annotate.py	2008-04-24 07:22:53 +0000
+++ bzrlib/annotate.py	2008-06-11 04:20:16 +0000
@@ -112,8 +112,8 @@

 def _annotations(repo, file_id, rev_id):
     """Return the list of (origin,text) for a revision of a file in a
repository."""
-    w = repo.weave_store.get_weave(file_id, repo.get_transaction())
-    return w.annotate(rev_id)
+    annotations = repo.texts.annotate((file_id, rev_id))
+    return [(key[-1], line) for key, line in annotations]

(trivial) I'd rather put parenthesis around both tuples in the comprehension.

 def _annotate_file(branch, rev_id, file_id):

=== modified file 'bzrlib/branch.py'
=== modified file 'bzrlib/builtins.py'
=== modified file 'bzrlib/bundle/serializer/v4.py'
--- bzrlib/bundle/serializer/v4.py	2008-05-08 04:33:38 +0000
+++ bzrlib/bundle/serializer/v4.py	2008-06-11 04:20:16 +0000
@@ -296,56 +301,27 @@
         self.bundle.add_info_record(serializer=serializer_format,
                                     supports_rich_root=supports_rich_root)

-    def iter_file_revisions(self):
-        """Iterate through all relevant revisions of all files.
-
-        This is the correct implementation, but is not compatible with bzr.dev,
-        because certain old revisions were not converted correctly, and have
-        the wrong "revision" marker in inventories.
-        """
-        transaction = self.repository.get_transaction()
-        altered = self.repository.fileids_altered_by_revision_ids(
-            self.revision_ids)
-        for file_id, file_revision_ids in altered.iteritems():
-            vf = self.repository.weave_store.get_weave(file_id, transaction)
-            yield vf, file_id, file_revision_ids
-
-    def iter_file_revisions_aggressive(self):
-        """Iterate through all relevant revisions of all files.
-
-        This uses the standard iter_file_revisions to determine what revisions
-        are referred to by inventories, but then uses the versionedfile to
-        determine what the build-dependencies of each required revision.
-
-        All build dependencies which are not ancestors of the base revision
-        are emitted.
-        """
-        for vf, file_id, file_revision_ids in self.iter_file_revisions():
-            new_revision_ids = set()
-            pending = list(file_revision_ids)
-            while len(pending) > 0:
-                revision_id = pending.pop()
-                if revision_id in new_revision_ids:
-                    continue
-                if revision_id in self.base_ancestry:
-                    continue
-                new_revision_ids.add(revision_id)
-                pending.extend(vf.get_parent_map([revision_id])[revision_id])
-            yield vf, file_id, new_revision_ids

It looks like the special algorithm used here is no longer active; is it really
not needed in sending this kind of data anymore?

-
     def write_files(self):
         """Write bundle records for all revisions of all files"""
-        for vf, file_id, revision_ids in self.iter_file_revisions():
-            self.add_mp_records('file', file_id, vf, revision_ids)
+        texts = self.repository.texts
+        text_keys = []
+        for file_id, revision_ids in \
+            self.repository.fileids_altered_by_revision_ids(
+                self.revision_ids).iteritems():
+            revision_ids = list(revision_ids)
+            for revision_id in revision_ids:
+                text_keys.append((file_id, revision_id))
+        self.add_mp_records_keys('file', texts, text_keys)

What is the function of the list coercion here?  If there is some side effect
can you explain it?

It's not worth binding texts to a local afaics as it's only used once.


     def write_revisions(self):
         """Write bundle records for all revisions and signatures"""
-        inv_vf = self.repository.get_inventory_weave()
-        revision_order = list(multiparent.topo_iter(inv_vf, self.revision_ids))
+        inv_vf = self.repository.inventories
+        revision_order = [key[-1] for key in multiparent.topo_iter_keys(inv_vf,
+            self.revision_keys)]
         if self.target is not None and self.target in self.revision_ids:
             revision_order.remove(self.target)
             revision_order.append(self.target)
-        self.add_mp_records('inventory', None, inv_vf, revision_order)
+        self.add_mp_records_keys('inventory', inv_vf, [(revid,) for
revid in revision_order])
         parent_map = self.repository.get_parent_map(revision_order)
         for revision_id in revision_order:
             parents = parent_map.get(revision_id, None)
@@ -374,17 +350,22 @@
                 base = parents[0]
         return base, target

-    def add_mp_records(self, repo_kind, file_id, vf, revision_ids):
+    def add_mp_records_keys(self, repo_kind, vf, keys):

Does this need to be public?

@@ -500,23 +481,19 @@
                 if self._info is not None:
                     raise AssertionError()
                 self._handle_info(metadata)
-            if (repo_kind, file_id) != ('file', current_file):
-                if len(pending_file_records) > 0:
-                    self._install_mp_records(current_versionedfile,
-                                             pending_file_records)
+            if (pending_file_records and
+                (repo_kind, file_id) != ('file', current_file)):
+                # Flush the data for a single file - prevents memory
+                # spiking due to buffering all files in memory.
+                self._install_mp_records_keys(self._repository.texts,
+                    pending_file_records)
                 current_file = None
-                current_versionedfile = None
-                pending_file_records = []
+                del pending_file_records[:]
             if len(pending_inventory_records) > 0 and repo_kind != 'inventory':
-                self._install_inventory_records(inventory_vf,
-                                                pending_inventory_records)
+                self._install_inventory_records(pending_inventory_records)
                 pending_inventory_records = []
             if repo_kind == 'inventory':
-                if inventory_vf is None:
-                    inventory_vf = self._repository.get_inventory_weave()
-                if revision_id not in inventory_vf:
-                    pending_inventory_records.append((revision_id, metadata,
-                                                      bytes))
+                pending_inventory_records.append(((revision_id,),
metadata, bytes))

You seem to be removing this check of whether the inventory is in the
repository, can you explain why?  (I'm not immediately sure why that check was
there before though.)

     def _install_revision(self, revision_id, metadata, text):
         if self._repository.has_revision(revision_id):
             return
-        if self._info['serializer'] == self._repository._serializer.format_num:
-            self._repository._add_revision_text(revision_id, text)
-        else:
-            revision = self._source_serializer.read_revision_from_string(text)
-            self._repository.add_revision(revision.revision_id, revision)
+        revision = self._source_serializer.read_revision_from_string(text)
+        self._repository.add_revision(revision.revision_id, revision)

So this seems to be changing to always deserialize the revision and then store
it into the repository as an object rather than as a text.  That seems pretty
reasonable...

     def _install_signature(self, revision_id, metadata, text):
         transaction = self._repository.get_transaction()
-        if self._repository._revision_store.has_signature(revision_id,
-                                                          transaction):
+        if self._repository.has_signature_for_revision_id(revision_id):
             return
-        self._repository._revision_store.add_revision_signature_text(
-            revision_id, text, transaction)
+        self._repository.add_signature_text(revision_id, text)




=== modified file 'bzrlib/bzrdir.py'
--- bzrlib/bzrdir.py	2008-05-22 05:48:22 +0000
+++ bzrlib/bzrdir.py	2008-06-11 07:22:00 +0000
@@ -61,8 +63,7 @@
     )
 from bzrlib.smart.client import _SmartClient
 from bzrlib.smart import protocol
-from bzrlib.store.revision.text import TextRevisionStore
-from bzrlib.store.text import TextStore
+from bzrlib.repofmt.weaverepo import TextRevisionStore, RevisionTextStore
 from bzrlib.store.versioned import WeaveStore
 from bzrlib.transactions import WriteTransaction
 from bzrlib.transport import (

Ideally we would be avoiding loading the weaverepo implementation unless it was
actually going to be used; but I'm not sure it will be significant.  I thought
that weaverepo was basically all old code so I'm surprised we would be moving
things into there or adding references to it.

=== modified file 'bzrlib/repofmt/weaverepo.py'
+        return [key[-1] for key in self.revisions.keys()]
+
+    def _activate_new_inventory(self):
+        """Put a replacement inventory.new into use as inventories."""
+        # Copy the content across
+        t = self._transport
+        t.copy('inventory.new.weave', 'inventory.weave')
+        # delete the temp inventory
+        t.delete('inventory.new.weave')
+        # Check we can parse the new weave properly as a sanity check
+        self.inventories.keys()
+
+    def _backup_inventory(self):
+        t = self._transport
+        t.copy('inventory.weave', 'inventory.backup.weave')
+
+    def _temp_inventories(self):
+        t = self._transport
+        return self._format._get_inventories(t, self, 'inventory.new')

These seem to be duplicated between AllInOneRepository and
WeaveMetaDirRepository.  Also _inventory_add_lines.  Rather than doing this
maybe we could make another base class to hold them.

Is Transport.copy guaranteed to overwrite an existing file (e.g. on Windows)?  I
don't think it is.

+class TextVersionedFiles(VersionedFiles):
+    """Just-a-bunch-of-files based VersionedFile stores."""
+
+    def __init__(self, transport, compressed, mapper, is_locked, can_write):
+        self._compressed = compressed
+        self._transport = transport
+        self._mapper = mapper
+        if self._compressed:
+            self._ext = '.gz'
+        else:
+            self._ext = ''
+        self._is_locked = is_locked
+        self._can_write = can_write
+
+    def add_lines(self, key, parents, lines):
+        """Add a revision to the store."""
+        if not self._is_locked():
+            raise errors.ObjectNotLocked(self)
+        if not self._can_write():
+            raise errors.ReadOnlyError(self)
+        if '/' in key[-1]:
+            raise ValueError('bad idea to put / in %r' % key)

Should be

   % (key,)

=== modified file 'bzrlib/repository.py'
--- bzrlib/repository.py	2008-06-04 07:29:35 +0000
+++ bzrlib/repository.py	2008-06-11 07:22:00 +0000
@@ -408,18 +408,14 @@
         return self._get_delta(ie, basis_inv, path), True

     def _add_text_to_weave(self, file_id, new_lines, parents, nostore_sha):
-        versionedfile = self.repository.weave_store.get_weave_or_empty(
-            file_id, self.repository.get_transaction())
-        # Don't change this to add_lines - add_lines_with_ghosts is cheaper
-        # than add_lines, and allows committing when a parent is ghosted for
-        # some reason.
         # Note: as we read the content directly from the tree, we know its not
         # been turned into unicode or badly split - but a broken tree
         # implementation could give us bad output from readlines() so this is
         # not a guarantee of safety. What would be better is always checking
         # the content during test suite execution. RBC 20070912
-        return versionedfile.add_lines_with_ghosts(
-            self._new_revision_id, parents, new_lines,
+        parent_keys = tuple((file_id, parent) for parent in parents)
+        return self.repository.texts.add_lines(
+            (file_id, self._new_revision_id), parent_keys, new_lines,
             nostore_sha=nostore_sha, random_id=self.random_revid,
             check_content=False)[0:2]

@@ -454,6 +450,12 @@
     describe the disk data format and the way of accessing the (possibly
     remote) disk.

+    :ivar revisions: A bzrlib.versionedfile.VersionedFiles instance containing
+        the serialised revisions for the repository. This can be used to obtain
+        revision graph information or to access raw serialised revisions.
+        The result of trying to insert data into the repository via this store
+        is undefined: it should be considered read-only except for implementors
+        of repositories.
     :ivar _transport: Transport for file access to repository, typically
         pointing to .bzr/repository.
     """

This should also list inventories, signatures, texts.

@@ -610,12 +608,12 @@
         """Construct the current default format repository in a_bzrdir."""
         return RepositoryFormat.get_default_format().initialize(a_bzrdir)

-    def __init__(self, _format, a_bzrdir, control_files,
-                 _revision_store, control_store, text_store):
+    def __init__(self, _format, a_bzrdir, control_files):
         """instantiate a Repository.

         :param _format: The format of the repository on disk.
         :param a_bzrdir: The BzrDir of the repository.
+        :param revisions: The revisions store for the repository.

         In the future we will have a single api for all stores for
         getting file texts, inventories and revisions, then

There is no parameter 'revisions'

@@ -766,10 +759,11 @@
                 last_revision.timezone)

         # now gather global repository information
+        # XXX: This is available for many repos regardless of listability.
         if self.bzrdir.root_transport.listable():
-            c, t = self._revision_store.total_size(self.get_transaction())
-            result['revisions'] = c
-            result['size'] = t
+            # XXX: do we want to __define len__() ?
+            result['revisions'] = len(self.revisions.keys())
+            # result['size'] = t
         return result


Can you expand on the first comment.

On the second: I would say generally we _don't_ want to define len anymore on
things that are not really collections.


     def find_branches(self, using=False):
@@ -1061,16 +1004,20 @@
         """True if this repository has a copy of the revision."""
         return revision_id in self.has_revisions((revision_id,))

+    @needs_read_lock
     def has_revisions(self, revision_ids):
         """Probe to find out the presence of multiple revisions.

         :param revision_ids: An iterable of revision_ids.
         :return: A set of the revision_ids that were present.
         """
-        raise NotImplementedError(self.has_revisions)
-
-        return self._revision_store.has_revision_id(revision_id,
-                                                    self.get_transaction())
+        parent_map = self.revisions.get_parent_map(
+            [(rev_id,) for rev_id in revision_ids])
+        result = set()
+        if _mod_revision.NULL_REVISION in revision_ids:
+            result.add(_mod_revision.NULL_REVISION)
+        result.update([key[0] for key in parent_map])
+        return result

This is a pretty low-level method; I'd tend to make the caller do the locking.
The concrete implementations did not have auto locking before.


     @needs_read_lock
     def get_revision(self, revision_id):
@@ -1510,10 +1459,21 @@

     def _iter_inventories(self, revision_ids):
         """single-document based inventory iteration."""
-        texts = self.get_inventory_weave().get_texts(revision_ids)
-        for text, revision_id in zip(texts, revision_ids):
+        for text, revision_id in self._iter_inventory_xmls(revision_ids):
             yield self.deserialise_inventory(revision_id, text)

+    def _iter_inventory_xmls(self, revision_ids):
+        keys = [(revision_id,) for revision_id in revision_ids]
+        stream = self.inventories.get_record_stream(keys, 'unordered', True)
+        texts = {}
+        for record in stream:
+            if record.storage_kind != 'absent':
+                texts[record.key] = record.get_bytes_as('fulltext')
+            else:
+                raise errors.NoSuchRevision(self, record.key)
+        for key in keys:
+            yield texts[key], key[-1]
+

Can we rename this to _iter_inventory_xmls; making apis that assume they're xml
is not so good when we may want to go away from that.

     def deserialise_inventory(self, revision_id, xml):
         """Transform the xml into an inventory object.

=== modified file 'bzrlib/versionedfile.py'

+class KeyMapper(object):
+    """KeyMappers map between keys and underlying paritioned storage."""
+
+    def map(self, key):
+        """Map key to an underlying storage identifier.
+
+        :param key: A key tuple e.g. ('file-id', 'revision-id').
+        :return: An underlying storage identifier, specific to the partitioning
+            mechanism.
+        """
+        raise NotImplementedError(self.map)
+
+    def unmap(self, partition_id):
+        """Map a partitioned storage id back to a key prefix.
+
+        :param partition_id: The underlying partition id.
+        :return: As much of a key (or prefix) as is derivable from the parition
+            id.
+        """
+        raise NotImplementedError(self.unmap)
+

typo should be "partitioned"

=== modified file 'bzrlib/weave_commands.py'
=== modified file 'bzrlib/workingtree.py'
=== modified file 'bzrlib/workingtree_4.py'
--- bzrlib/workingtree_4.py	2008-06-06 16:40:46 +0000
+++ bzrlib/workingtree_4.py	2008-06-11 07:22:00 +0000
@@ -1423,8 +1423,9 @@
     def annotate_iter(self, file_id,
                       default_revision=_mod_revision.CURRENT_REVISION):
         """See Tree.annotate_iter"""
-        w = self._get_weave(file_id)
-        return w.annotate(self.inventory[file_id].revision)
+        text_key = (file_id, self.inventory[file_id].revision)
+        annotations = self._repository.texts.annotate(text_key)
+        return [(key[-1], line) for key, line in annotations]

     def _get_ancestors(self, default_revision):
         return set(self._repository.get_ancestry(self._revision_id,

I'm a bit surprised this works in either version when annotating the current
revision from the working tree...  I guess that is done through a different
path.  At any rate the new code is no different.

I'd rather put parenthesis around both tuples in the comprehension.

@@ -1582,25 +1583,18 @@
             return parent_details[1]
         return None

-    def _get_weave(self, file_id):
-        return self._repository.weave_store.get_weave(file_id,
-                self._repository.get_transaction())
-
     def get_file(self, file_id, path=None):
         return StringIO(self.get_file_text(file_id))

     def get_file_lines(self, file_id):
-        entry = self._get_entry(file_id=file_id)[1]
-        if entry is None:
-            raise errors.NoSuchId(tree=self, file_id=file_id)
-        return self._get_weave(file_id).get_lines(entry[1][4])
+        return osutils.split_lines(self.get_file_text(file_id))

     def get_file_size(self, file_id):
         """See Tree.get_file_size"""
         return self.inventory[file_id].text_size

     def get_file_text(self, file_id):
-        return ''.join(self.get_file_lines(file_id))
+        return list(self.iter_files_bytes([(file_id, None)]))[0][1]

     def get_reference_revision(self, file_id, path=None):
         return self.inventory[file_id].reference_revision



More information about the bazaar mailing list