Rev 2795: Change versionedfile.add_lines again to include the key of the added text in the return value. in http://people.ubuntu.com/~robertc/baz2.0/knits

Wed Sep 5 00:41:10 BST 2007

At http://people.ubuntu.com/~robertc/baz2.0/knits

------------------------------------------------------------
revno: 2795
revision-id: robertc at robertcollins.net-20070904234057-s1v9l9q1w5jt25co
parent: pqm at pqm.ubuntu.com-20070904035759-iv4xl6d7ez69txba
parent: robertc at robertcollins.net-20070816083953-sbfb70vw6tmh3vak
committer: Robert Collins <robertc at robertcollins.net>
branch nick: knits
timestamp: Wed 2007-09-05 09:40:57 +1000
message:
  Change versionedfile.add_lines again to include the key of the added text in the return value.
modified:
  NEWS                           NEWS-20050323055033-4e00b5db738777ff
  bzrlib/fetch.py                fetch.py-20050818234941-26fea6105696365d
  bzrlib/knit.py                 knit.py-20051212171256-f056ac8f0fbe1bd9
  bzrlib/tests/test_knit.py      test_knit.py-20051212171302-95d4c00dd5f11f2b
  bzrlib/tests/test_versionedfile.py test_versionedfile.py-20060222045249-db45c9ed14a1c2e5
  bzrlib/versionedfile.py        versionedfile.py-20060222045106-5039c71ee3b65490
  bzrlib/weave.py                knit.py-20050627021749-759c29984154256b
    ------------------------------------------------------------
    revno: 2698.2.5
    revision-id: robertc at robertcollins.net-20070816083953-sbfb70vw6tmh3vak
    parent: robertc at robertcollins.net-20070816081927-rhroje8susrd3a40
    committer: Robert Collins <robertc at robertcollins.net>
    branch nick: knits
    timestamp: Thu 2007-08-16 18:39:53 +1000
    message:
      Decouple parsing and iterating the lines in knit records from getting the data, making it suitable for use in pack repositories.
    modified:
      bzrlib/knit.py                 knit.py-20051212171256-f056ac8f0fbe1bd9
    ------------------------------------------------------------
    revno: 2698.2.4
    revision-id: robertc at robertcollins.net-20070816081927-rhroje8susrd3a40
    parent: robertc at robertcollins.net-20070816081549-dpowek5gwvox1x56
    committer: Robert Collins <robertc at robertcollins.net>
    branch nick: knits
    timestamp: Thu 2007-08-16 18:19:27 +1000
    message:
      Remove full history scan during iter_lines_added_or_present in KnitVersionedFile.
    modified:
      bzrlib/knit.py                 knit.py-20051212171256-f056ac8f0fbe1bd9
    ------------------------------------------------------------
    revno: 2698.2.3
    revision-id: robertc at robertcollins.net-20070816081549-dpowek5gwvox1x56
    parent: robertc at robertcollins.net-20070816081414-ps82io1cs4cij6vz
    committer: Robert Collins <robertc at robertcollins.net>
    branch nick: knits
    timestamp: Thu 2007-08-16 18:15:49 +1000
    message:
      Remove an unneeded pre-check in KnitVersionedFile.iter_lines_added_or_present_in_version.
    modified:
      bzrlib/knit.py                 knit.py-20051212171256-f056ac8f0fbe1bd9
    ------------------------------------------------------------
    revno: 2698.2.2
    revision-id: robertc at robertcollins.net-20070816081414-ps82io1cs4cij6vz
    parent: robertc at robertcollins.net-20070816080814-49b7t5gghdrhcx4d
    committer: Robert Collins <robertc at robertcollins.net>
    branch nick: knits
    timestamp: Thu 2007-08-16 18:14:14 +1000
    message:
      Use the low-level sorting facility in KnitVersionedFile.iter_records
    modified:
      bzrlib/knit.py                 knit.py-20051212171256-f056ac8f0fbe1bd9
    ------------------------------------------------------------
    revno: 2698.2.1
    revision-id: robertc at robertcollins.net-20070816080814-49b7t5gghdrhcx4d
    parent: pqm at pqm.ubuntu.com-20070814221506-6rw0b0oolfdeqrdw
    committer: Robert Collins <robertc at robertcollins.net>
    branch nick: knits
    timestamp: Thu 2007-08-16 18:08:14 +1000
    message:
      Add new get_raw_records_unsorted method on knit access, to allow low level sorting and optimisation when the upper layer does not need results in a particular order.
    modified:
      bzrlib/knit.py                 knit.py-20051212171256-f056ac8f0fbe1bd9
      bzrlib/tests/test_knit.py      test_knit.py-20051212171302-95d4c00dd5f11f2b
=== modified file 'NEWS'

--- a/NEWS	2007-09-04 01:20:26 +0000
+++ b/NEWS	2007-09-04 23:40:57 +0000
@@ -165,8 +165,8 @@
      ``Branch.set_last_revision_info`` instead.  (Martin Pool)
 
    * The ``add_lines`` methods on ``VersionedFile`` implementations has changed
-     its return value to include the sha1 and length of the inserted text. This
-     allows the avoidance of double-sha1 calculations during commit.
+     its return value to include the sha1, length and key for the inserted
+     text. This allows the avoidance of double-sha1 calculations during commit.
      (Robert Collins)
 
    * ``Transport.should_cache`` has been removed.  It was not called in the

=== modified file 'bzrlib/fetch.py'
--- a/bzrlib/fetch.py	2007-09-03 02:58:58 +0000
+++ b/bzrlib/fetch.py	2007-09-04 23:40:57 +0000
@@ -349,9 +349,9 @@
             root_id = tree.inventory.root.file_id
             parents = inventory_weave.get_parents(revision_id)
             if root_id not in versionedfile:
-                versionedfile[root_id] = to_store.get_weave_or_empty(root_id, 
+                versionedfile[root_id] = to_store.get_weave_or_empty(root_id,
                     self.target.get_transaction())
-            _, _, parent_texts[root_id] = versionedfile[root_id].add_lines(
+            _, _, _, parent_texts[root_id] = versionedfile[root_id].add_lines(
                 revision_id, parents, [], parent_texts)
 
     def regenerate_inventory(self, revs):

=== modified file 'bzrlib/knit.py'
--- a/bzrlib/knit.py	2007-09-03 21:19:07 +0000
+++ b/bzrlib/knit.py	2007-09-04 23:40:57 +0000
@@ -945,7 +945,7 @@
 
         access_memo = self._data.add_record(version_id, digest, store_lines)
         self._index.add_version(version_id, options, access_memo, parents)
-        return digest, text_length, lines
+        return digest, text_length, version_id, lines
 
     def check(self, progress_bar=None):
         """See VersionedFile.check()."""
@@ -1061,38 +1061,22 @@
             version_ids = self.versions()
         else:
             version_ids = [osutils.safe_revision_id(v) for v in version_ids]
-        if pb is None:
-            pb = progress.DummyProgress()
         # we don't care about inclusions, the caller cares.
         # but we need to setup a list of records to visit.
         # we need version_id, position, length
         version_id_records = []
         requested_versions = set(version_ids)
-        # filter for available versions
+        methods = {}
+        # create set of records to read:
         for version_id in requested_versions:
-            if not self.has_version(version_id):
-                raise RevisionNotPresent(version_id, self.filename)
-        # get a in-component-order queue:
-        for version_id in self.versions():
-            if version_id in requested_versions:
-                index_memo = self._index.get_position(version_id)
-                version_id_records.append((version_id, index_memo))
-
+            index_memo = self._index.get_position(version_id)
+            method  = self._index.get_method(version_id)
+            version_id_records.append((version_id, index_memo))
+            methods[version_id] = method
         total = len(version_id_records)
-        for version_idx, (version_id, data, sha_value) in \
-            enumerate(self._data.read_records_iter(version_id_records)):
-            pb.update('Walking content.', version_idx, total)
-            method = self._index.get_method(version_id)
-
-            assert method in ('fulltext', 'line-delta')
-            if method == 'fulltext':
-                line_iterator = self.factory.get_fulltext_content(data)
-            else:
-                line_iterator = self.factory.get_linedelta_content(data)
-            for line in line_iterator:
-                yield line
-
-        pb.update('Walking content.', total, total)
+        return self._data.iter_lines_added_or_present_in_records(
+            self._data.read_records_iter(version_id_records),
+            methods, self.factory, pb, total)
         
     def iter_parents(self, version_ids):
         """Iterate through the parents for many version ids.
@@ -1794,7 +1778,25 @@
         return set((version_id, ) for version_id in version_ids)
 
 
-class _KnitAccess(object):
+class _Access(object):
+    """Base class with common logic for accessing knit record data."""
+
+    def get_raw_records_unsorted(self, memos_for_retrieval):
+        """Get the raw bytes for many records with 'best' IO.
+
+        :param memos_for_retrieval: An iterable containing the memo's to 
+            use when retrieving the bytes. The Pack access method looks up the
+            pack to use for a given record in its index_to_pack map.
+        :return: An iterator over (memos, bytes) for all the requested
+            records.
+        """
+        # sort the memos
+        sorted_memos = sorted(memos_for_retrieval)
+        # delegate to the concrete class to get the now sorted records.
+        return izip(sorted_memos, self.get_raw_records(sorted_memos))
+
+
+class _KnitAccess(_Access):
     """Access to knit records in a .knit file."""
 
     def __init__(self, transport, filename, _file_mode, _dir_mode,
@@ -1874,7 +1876,7 @@
             yield data
 
 
-class _PackAccess(object):
+class _PackAccess(_Access):
     """Access to knit records via a collection of packs."""
 
     def __init__(self, index_to_packs, writer=None):
@@ -1995,6 +1997,36 @@
     def _open_file(self):
         return self._access.open_file()
 
+    def iter_lines_added_or_present_in_records(self, record_iterator, methods, 
+        record_parser, pb=None, record_count=0):
+        """Read, parse and yield the contents of records as lines.
+
+        :param record_iterator: An iterable of version_id, data, sha_value for
+            the records to process.
+        :param methods: A dict of version_id -> method.
+        :param record_parser: A knit record parser which can parse each
+            record.
+        :param pb: A progress bar, or None.
+        :param record_count: A total for the progress bar if one is supplied.
+        :return: An iterator over all the lines in no particular order.
+        """
+        if pb is None:
+            pb = progress.DummyProgress()
+        for version_idx, (version_id, data, sha_value) in \
+            enumerate(record_iterator):
+            pb.update('Walking content.', version_idx, record_count)
+            method = methods[version_id]
+
+            assert method in ('fulltext', 'line-delta')
+            if method == 'fulltext':
+                line_iterator = record_parser.get_fulltext_content(data)
+            else:
+                line_iterator = record_parser.get_linedelta_content(data)
+            for line in line_iterator:
+                yield line
+
+        pb.update('Walking content.', record_count, record_count)
+
     def _record_to_data(self, version_id, digest, lines):
         """Convert version_id, digest, lines into a raw data block.
         
@@ -2136,6 +2168,7 @@
         The result will be returned in whatever is the fastest to read.
         Not by the order requested. Also, multiple requests for the same
         record will only yield 1 response.
+
         :param records: A list of (version_id, pos, len) entries
         :return: Yields (version_id, contents, digest) in the order
                  read, not the order requested
@@ -2157,20 +2190,25 @@
                     yield (record[0], content, digest)
                 else:
                     needed_records.add(record)
-            needed_records = sorted(needed_records, key=operator.itemgetter(1))
         else:
-            needed_records = sorted(set(records), key=operator.itemgetter(1))
+            needed_records = records
 
         if not needed_records:
             return
 
+        # let the access object optimise lookups, so setup a mapping back to
+        # version_ids.
+        needed_memos = {}
+        for version_id, index_memo in needed_records:
+            needed_memos[index_memo] = version_id
+
         # The transport optimizes the fetching as well 
         # (ie, reads continuous ranges.)
-        raw_data = self._access.get_raw_records(
-            [index_memo for version_id, index_memo in needed_records])
+        raw_results = self._access.get_raw_records_unsorted(
+            needed_memos.iterkeys())
 
-        for (version_id, index_memo), data in \
-                izip(iter(needed_records), raw_data):
+        for index_memo, data in raw_results:
+            version_id = needed_memos[index_memo]
             content, digest = self._parse_record(version_id, data)
             if self._do_cache:
                 self._cache[version_id] = data

=== modified file 'bzrlib/tests/test_knit.py'
--- a/bzrlib/tests/test_knit.py	2007-08-30 08:27:29 +0000
+++ b/bzrlib/tests/test_knit.py	2007-09-04 23:40:57 +0000
@@ -188,6 +188,16 @@
         access.create()
         self.assertAccessExists(access)
 
+    def test_get_raw_records_unsorted(self):
+        """get_raw_records_unsorted returns in best-read order."""
+        access = self.get_access()
+        memos = access.add_raw_records([10, 2, 5], '12345678901234567')
+        expected_result = zip(memos, ['1234567890', '12', '34567'])
+        self.assertEqual(expected_result,
+            list(access.get_raw_records_unsorted(memos)))
+        self.assertEqual(expected_result,
+            list(access.get_raw_records_unsorted(reversed(memos))))
+
     def test_open_file(self):
         """open_file never errors."""
         access = self.get_access()

=== modified file 'bzrlib/tests/test_versionedfile.py'
--- a/bzrlib/tests/test_versionedfile.py	2007-09-03 21:19:07 +0000
+++ b/bzrlib/tests/test_versionedfile.py	2007-09-04 23:40:57 +0000
@@ -84,13 +84,13 @@
     def test_adds_with_parent_texts(self):
         f = self.get_file()
         parent_texts = {}
-        _, _, parent_texts['r0'] = f.add_lines('r0', [], ['a\n', 'b\n'])
+        _, _, _, parent_texts['r0'] = f.add_lines('r0', [], ['a\n', 'b\n'])
         try:
-            _, _, parent_texts['r1'] = f.add_lines_with_ghosts('r1',
+            _, _, _, parent_texts['r1'] = f.add_lines_with_ghosts('r1',
                 ['r0', 'ghost'], ['b\n', 'c\n'], parent_texts=parent_texts)
         except NotImplementedError:
             # if the format doesn't support ghosts, just add normally.
-            _, _, parent_texts['r1'] = f.add_lines('r1',
+            _, _, _, parent_texts['r1'] = f.add_lines('r1',
                 ['r0'], ['b\n', 'c\n'], parent_texts=parent_texts)
         f.add_lines('r2', ['r1'], ['c\n', 'd\n'], parent_texts=parent_texts)
         self.assertNotEqual(None, parent_texts['r0'])
@@ -168,7 +168,8 @@
             vf.add_delta, 'a:', [], None, 'sha1', False, ((0, 0, 0, []),))
 
     def test_add_lines_return_value(self):
-        # add_lines should return the sha1 and the text size.
+        # add_lines should return:
+        # the sha1, text size, key, opaque-parent-representation
         vf = self.get_file()
         empty_text = ('a', [])
         sample_text_nl = ('b', ["foo\n", "bar\n"])
@@ -178,14 +179,18 @@
             # the first two elements are the same for all versioned files:
             # - the digest and the size of the text. For some versioned files
             #   additional data is returned in additional tuple elements.
+            # then comes the key. Currently all VF implementations use user
+            # supplied keys, so we just cross-reference back to version.
             result = vf.add_lines(version, [], lines)
-            self.assertEqual(3, len(result))
-            self.assertEqual((osutils.sha_strings(lines), sum(map(len, lines))),
-                result[0:2])
+            self.assertEqual(4, len(result))
+            self.assertEqual(
+                (osutils.sha_strings(lines), sum(map(len, lines)), version),
+                result[0:3])
         # parents should not affect the result:
         lines = sample_text_nl[1]
-        self.assertEqual((osutils.sha_strings(lines), sum(map(len, lines))),
-            vf.add_lines('d', ['b', 'c'], lines)[0:2])
+        self.assertEqual(
+            (osutils.sha_strings(lines), sum(map(len, lines)), 'd'),
+            vf.add_lines('d', ['b', 'c'], lines)[0:3])
 
     def test_get_reserved(self):
         vf = self.get_file()

=== modified file 'bzrlib/versionedfile.py'
--- a/bzrlib/versionedfile.py	2007-09-03 20:23:05 +0000
+++ b/bzrlib/versionedfile.py	2007-09-04 23:40:57 +0000
@@ -130,9 +130,10 @@
         :param left_matching_blocks: a hint about which areas are common
             between the text and its left-hand-parent.  The format is
             the SequenceMatcher.get_matching_blocks format.
-        :return: The text sha1, the number of bytes in the text, and an opaque
-                 representation of the inserted version which can be provided
-                 back to future add_lines calls in the parent_texts dictionary.
+        :return: The text sha1, the number of bytes in the text, the key to
+            obtain the lines back again and an opaque representation of the
+            inserted version which can be provided back to future add_lines
+            calls in the parent_texts dictionary.
         """
         version_id = osutils.safe_revision_id(version_id)
         parents = [osutils.safe_revision_id(v) for v in parents]
@@ -318,7 +319,7 @@
                     mpvf.get_diff(parent_ids[0]).num_lines()))
             else:
                 left_matching_blocks = None
-            _, _, version_text = self.add_lines(version, parent_ids, lines,
+            _, _, _, version_text = self.add_lines(version, parent_ids, lines,
                 vf_parents, left_matching_blocks=left_matching_blocks)
             vf_parents[version] = version_text
         for (version, parent_ids, expected_sha1, mpdiff), sha1 in\
@@ -687,7 +688,7 @@
             # deltas = self.source.get_deltas(order)
             for index, version in enumerate(order):
                 pb.update('Converting versioned data', index, len(order))
-                _, _, parent_text = target.add_lines(version,
+                _, _, _, parent_text = target.add_lines(version,
                                                self.source.get_parents(version),
                                                self.source.get_lines(version),
                                                parent_texts=parent_texts)

=== modified file 'bzrlib/weave.py'
--- a/bzrlib/weave.py	2007-09-03 02:58:58 +0000
+++ b/bzrlib/weave.py	2007-09-04 23:40:57 +0000
@@ -420,7 +420,7 @@
                    left_matching_blocks=None):
         """See VersionedFile.add_lines."""
         idx = self._add(version_id, lines, map(self._lookup, parents))
-        return sha_strings(lines), sum(map(len, lines)), idx
+        return sha_strings(lines), sum(map(len, lines)), version_id, idx
 
     def _add(self, version_id, lines, parents, sha1=None):
         """Add a single text on top of the weave.