Rev 4051: (jam) Batch up requests for fulltexts into 5MB (compressed) requests, in file:///home/pqm/archives/thelove/bzr/%2Btrunk/

Wed Feb 25 22:00:28 GMT 2009

At file:///home/pqm/archives/thelove/bzr/%2Btrunk/

------------------------------------------------------------
revno: 4051
revision-id: pqm at pqm.ubuntu.com-20090225220024-b81h6glz8zi2ekfh
parent: pqm at pqm.ubuntu.com-20090225171156-l63eiz2bz51ialsg
parent: john at arbash-meinel.com-20090225211614-ah3tpucd27g92lxj
committer: Canonical.com Patch Queue Manager <pqm at pqm.ubuntu.com>
branch nick: +trunk
timestamp: Wed 2009-02-25 22:00:24 +0000
message:
  (jam) Batch up requests for fulltexts into 5MB (compressed) requests,
  	rather than 1 per file.
modified:
  NEWS                           NEWS-20050323055033-4e00b5db738777ff
  bzrlib/knit.py                 knit.py-20051212171256-f056ac8f0fbe1bd9
  bzrlib/tests/test_knit.py      test_knit.py-20051212171302-95d4c00dd5f11f2b
    ------------------------------------------------------------
    revno: 4039.3.8
    revision-id: john at arbash-meinel.com-20090225211614-ah3tpucd27g92lxj
    parent: john at arbash-meinel.com-20090225211322-qc94czk3s1g7nliq
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: sort_knit_fetch
    timestamp: Wed 2009-02-25 15:16:14 -0600
    message:
      NEWS entry about batching up requests.
    modified:
      NEWS                           NEWS-20050323055033-4e00b5db738777ff
    ------------------------------------------------------------
    revno: 4039.3.7
    revision-id: john at arbash-meinel.com-20090225211322-qc94czk3s1g7nliq
    parent: john at arbash-meinel.com-20090225202304-j52lrdrx8aw101uh
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: sort_knit_fetch
    timestamp: Wed 2009-02-25 15:13:22 -0600
    message:
      Some direct tests for _group_keys_for_io
    modified:
      bzrlib/knit.py                 knit.py-20051212171256-f056ac8f0fbe1bd9
      bzrlib/tests/test_knit.py      test_knit.py-20051212171302-95d4c00dd5f11f2b
    ------------------------------------------------------------
    revno: 4039.3.6
    revision-id: john at arbash-meinel.com-20090225202304-j52lrdrx8aw101uh
    parent: john at arbash-meinel.com-20090225201213-fwl3dlpkraaogv7l
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: sort_knit_fetch
    timestamp: Wed 2009-02-25 14:23:04 -0600
    message:
      Turn _split_by_prefix into a classmethod, and add direct tests.
    modified:
      bzrlib/knit.py                 knit.py-20051212171256-f056ac8f0fbe1bd9
      bzrlib/tests/test_knit.py      test_knit.py-20051212171302-95d4c00dd5f11f2b
    ------------------------------------------------------------
    revno: 4039.3.5
    revision-id: john at arbash-meinel.com-20090225201213-fwl3dlpkraaogv7l
    parent: john at arbash-meinel.com-20090224230635-cd22mdoebjoggynp
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: sort_knit_fetch
    timestamp: Wed 2009-02-25 14:12:13 -0600
    message:
      Add direct tests for _get_total_build_size.
    modified:
      bzrlib/knit.py                 knit.py-20051212171256-f056ac8f0fbe1bd9
      bzrlib/tests/test_knit.py      test_knit.py-20051212171302-95d4c00dd5f11f2b
    ------------------------------------------------------------
    revno: 4039.3.4
    revision-id: john at arbash-meinel.com-20090224230635-cd22mdoebjoggynp
    parent: john at arbash-meinel.com-20090224221303-bvyeg22ag3e3b377
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: sort_knit_fetch
    timestamp: Tue 2009-02-24 17:06:35 -0600
    message:
      Properly determine the total number of bytes needed for a given key.
      We need to evaluate the whole build chain, in order to get the final build info.
      The earlier code only evaluated the final node of the chain, which gave
      unrealistically low sizes.
      With this, bzr.dev get split into 2 requests, which is reasonable.
    modified:
      bzrlib/knit.py                 knit.py-20051212171256-f056ac8f0fbe1bd9
    ------------------------------------------------------------
    revno: 4039.3.3
    revision-id: john at arbash-meinel.com-20090224221303-bvyeg22ag3e3b377
    parent: john at arbash-meinel.com-20090224215405-lelreyxl992nxxr9
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: sort_knit_fetch
    timestamp: Tue 2009-02-24 16:13:03 -0600
    message:
      Add some debugging code.
    modified:
      bzrlib/knit.py                 knit.py-20051212171256-f056ac8f0fbe1bd9
    ------------------------------------------------------------
    revno: 4039.3.2
    revision-id: john at arbash-meinel.com-20090224215405-lelreyxl992nxxr9
    parent: john at arbash-meinel.com-20090224202307-qxv0wgz3cikobn0c
    parent: john at arbash-meinel.com-20090128190831-xqmvp0hfiimyjsb5
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: sort_knit_fetch
    timestamp: Tue 2009-02-24 15:54:05 -0600
    message:
      Batch get_record_stream(fulltexts) into 5MB requests.
    modified:
      bzrlib/knit.py                 knit.py-20051212171256-f056ac8f0fbe1bd9
        ------------------------------------------------------------
        revno: 3966.3.1
        revision-id: john at arbash-meinel.com-20090128190831-xqmvp0hfiimyjsb5
        parent: pqm at pqm.ubuntu.com-20090128105451-ih99x6vzhtmfysjf
        committer: John Arbash Meinel <john at arbash-meinel.com>
        branch nick: knit_fill
        timestamp: Wed 2009-01-28 13:08:31 -0600
        message:
          Quick attempt at re-combining the knit requests after splitting them by file-id.
        modified:
          bzrlib/knit.py                 knit.py-20051212171256-f056ac8f0fbe1bd9
    ------------------------------------------------------------
    revno: 4039.3.1
    revision-id: john at arbash-meinel.com-20090224202307-qxv0wgz3cikobn0c
    parent: pqm at pqm.ubuntu.com-20090224103149-b9a60tx1qy68jtcj
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: sort_knit_fetch
    timestamp: Tue 2009-02-24 14:23:07 -0600
    message:
      Group records to read by pack file and sort by offset.
      
      This assumes that sort(index_memo) gives the right ordering, but for all 
      formats so far, this holds true.
      
      This makes a *big* difference when doing 'bzr branch --stacked' or
      'bzr checkout --lightweight'.
      
      If you don't group the requests, then the next level splits them up
      into many readv() requests, because it splits them by pack/knit file.
    modified:
      bzrlib/knit.py                 knit.py-20051212171256-f056ac8f0fbe1bd9
=== modified file 'NEWS'

--- a/NEWS	2009-02-25 14:36:59 +0000
+++ b/NEWS	2009-02-25 22:00:24 +0000
@@ -46,6 +46,11 @@
       stream if the server is version 1.13 or greater, reducing roundtrips
       significantly. (Andrew Bennetts, Robert Collins)
 
+    * Lightweight Checkouts and Stacked Branches should both be much
+      faster over remote connections. Building the working tree now
+      batches up requests into approx 5MB requests, rather than a separate
+      request for each file. (John Arbash Meinel)
+
     * Support for GSSAPI authentication when using HTTP or HTTPS. 
       (Jelmer Vernooij)
 

=== modified file 'bzrlib/knit.py'
--- a/bzrlib/knit.py	2009-02-23 15:29:35 +0000
+++ b/bzrlib/knit.py	2009-02-25 21:13:22 +0000
@@ -123,6 +123,7 @@
 
 DATA_SUFFIX = '.knit'
 INDEX_SUFFIX = '.kndx'
+_STREAM_MIN_BUFFER_SIZE = 5*1024*1024
 
 
 class KnitAdapter(object):
@@ -806,6 +807,35 @@
     versioned_files.writer.end()
 
 
+def _get_total_build_size(self, keys, positions):
+    """Determine the total bytes to build these keys.
+
+    (helper function because _KnitGraphIndex and _KndxIndex work the same, but
+    don't inherit from a common base.)
+
+    :param keys: Keys that we want to build
+    :param positions: dict of {key, (info, index_memo, comp_parent)} (such
+        as returned by _get_components_positions)
+    :return: Number of bytes to build those keys
+    """
+    all_build_index_memos = {}
+    build_keys = keys
+    while build_keys:
+        next_keys = set()
+        for key in build_keys:
+            # This is mostly for the 'stacked' case
+            # Where we will be getting the data from a fallback
+            if key not in positions:
+                continue
+            _, index_memo, compression_parent = positions[key]
+            all_build_index_memos[key] = index_memo
+            if compression_parent not in all_build_index_memos:
+                next_keys.add(compression_parent)
+        build_keys = next_keys
+    return sum([index_memo[2] for index_memo
+                in all_build_index_memos.itervalues()])
+
+
 class KnitVersionedFiles(VersionedFiles):
     """Storage for many versioned files using knit compression.
 
@@ -1181,6 +1211,9 @@
                 # n = next
                 records = [(key, i_m) for key, (r, i_m, n)
                                        in position_map.iteritems()]
+                # Sort by the index memo, so that we request records from the
+                # same pack file together, and in forward-sorted order
+                records.sort(key=operator.itemgetter(1))
                 raw_record_map = {}
                 for key, data in self._read_records_iter_unchecked(records):
                     (record_details, index_memo, next) = position_map[key]
@@ -1189,7 +1222,8 @@
             except errors.RetryWithNewPacks, e:
                 self._access.reload_or_raise(e)
 
-    def _split_by_prefix(self, keys):
+    @classmethod
+    def _split_by_prefix(cls, keys):
         """For the given keys, split them up based on their prefix.
 
         To keep memory pressure somewhat under control, split the
@@ -1198,16 +1232,75 @@
         This should be revisited if _get_content_maps() can ever cross
         file-id boundaries.
 
+        The keys for a given file_id are kept in the same relative order.
+        Ordering between file_ids is not, though prefix_order will return the
+        order that the key was first seen.
+
         :param keys: An iterable of key tuples
-        :return: A dict of {prefix: [key_list]}
+        :return: (split_map, prefix_order)
+            split_map       A dictionary mapping prefix => keys
+            prefix_order    The order that we saw the various prefixes
         """
         split_by_prefix = {}
+        prefix_order = []
         for key in keys:
             if len(key) == 1:
-                split_by_prefix.setdefault('', []).append(key)
-            else:
-                split_by_prefix.setdefault(key[0], []).append(key)
-        return split_by_prefix
+                prefix = ''
+            else:
+                prefix = key[0]
+
+            if prefix in split_by_prefix:
+                split_by_prefix[prefix].append(key)
+            else:
+                split_by_prefix[prefix] = [key]
+                prefix_order.append(prefix)
+        return split_by_prefix, prefix_order
+
+    def _group_keys_for_io(self, keys, non_local_keys, positions,
+                           _min_buffer_size=_STREAM_MIN_BUFFER_SIZE):
+        """For the given keys, group them into 'best-sized' requests.
+
+        The idea is to avoid making 1 request per file, but to never try to
+        unpack an entire 1.5GB source tree in a single pass. Also when
+        possible, we should try to group requests to the same pack file
+        together.
+
+        :return: list of (keys, non_local) tuples that indicate what keys
+            should be fetched next.
+        """
+        # TODO: Ideally we would group on 2 factors. We want to extract texts
+        #       from the same pack file together, and we want to extract all
+        #       the texts for a given build-chain together. Ultimately it
+        #       probably needs a better global view.
+        total_keys = len(keys)
+        prefix_split_keys, prefix_order = self._split_by_prefix(keys)
+        prefix_split_non_local_keys, _ = self._split_by_prefix(non_local_keys)
+        cur_keys = []
+        cur_non_local = set()
+        cur_size = 0
+        result = []
+        sizes = []
+        for prefix in prefix_order:
+            keys = prefix_split_keys[prefix]
+            non_local = prefix_split_non_local_keys.get(prefix, [])
+
+            this_size = self._index._get_total_build_size(keys, positions)
+            cur_size += this_size
+            cur_keys.extend(keys)
+            cur_non_local.update(non_local)
+            if cur_size > _min_buffer_size:
+                result.append((cur_keys, cur_non_local))
+                sizes.append(cur_size)
+                cur_keys = []
+                cur_non_local = set()
+                cur_size = 0
+        if cur_keys:
+            result.append((cur_keys, cur_non_local))
+            sizes.append(cur_size)
+        trace.mutter('Collapsed %d keys into %d requests w/ %d file_ids'
+                     ' w/ sizes: %s', total_keys, len(result),
+                     len(prefix_split_keys), sizes)
+        return result
 
     def get_record_stream(self, keys, ordering, include_delta_closure):
         """Get a stream of records for keys.
@@ -1334,13 +1427,11 @@
             # XXX: get_content_maps performs its own index queries; allow state
             # to be passed in.
             non_local_keys = needed_from_fallback - absent_keys
-            prefix_split_keys = self._split_by_prefix(present_keys)
-            prefix_split_non_local_keys = self._split_by_prefix(non_local_keys)
-            for prefix, keys in prefix_split_keys.iteritems():
-                non_local = prefix_split_non_local_keys.get(prefix, [])
-                non_local = set(non_local)
-                generator = _VFContentMapGenerator(self, keys, non_local,
-                    global_map)
+            for keys, non_local_keys in self._group_keys_for_io(present_keys,
+                                                                non_local_keys,
+                                                                positions):
+                generator = _VFContentMapGenerator(self, keys, non_local_keys,
+                                                   global_map)
                 for record in generator.get_record_stream():
                     yield record
         else:
@@ -2569,6 +2660,8 @@
             return index_memo[0][:-1], index_memo[1]
         return keys.sort(key=get_sort_key)
 
+    _get_total_build_size = _get_total_build_size
+
     def _split_key(self, key):
         """Split key into a prefix and suffix."""
         return key[:-1], key[-1]
@@ -2890,6 +2983,8 @@
             return positions[key][1]
         return keys.sort(key=get_index_memo)
 
+    _get_total_build_size = _get_total_build_size
+
 
 class _KnitKeyAccess(object):
     """Access to records in .knit files."""

=== modified file 'bzrlib/tests/test_knit.py'
--- a/bzrlib/tests/test_knit.py	2009-02-23 15:42:47 +0000
+++ b/bzrlib/tests/test_knit.py	2009-02-25 21:13:22 +0000
@@ -1093,6 +1093,26 @@
             call[1][1].getvalue())
         self.assertEqual({'create_parent_dir': True}, call[2])
 
+    def assertTotalBuildSize(self, size, keys, positions):
+        self.assertEqual(size,
+                         knit._get_total_build_size(None, keys, positions))
+
+    def test__get_total_build_size(self):
+        positions = {
+            ('a',): (('fulltext', False), (('a',), 0, 100), None),
+            ('b',): (('line-delta', False), (('b',), 100, 21), ('a',)),
+            ('c',): (('line-delta', False), (('c',), 121, 35), ('b',)),
+            ('d',): (('line-delta', False), (('d',), 156, 12), ('b',)),
+            }
+        self.assertTotalBuildSize(100, [('a',)], positions)
+        self.assertTotalBuildSize(121, [('b',)], positions)
+        # c needs both a & b
+        self.assertTotalBuildSize(156, [('c',)], positions)
+        # we shouldn't count 'b' twice
+        self.assertTotalBuildSize(156, [('b',), ('c',)], positions)
+        self.assertTotalBuildSize(133, [('d',)], positions)
+        self.assertTotalBuildSize(168, [('c',), ('d',)], positions)
+
     def test_get_position(self):
         transport = MockTransport([
             _KndxIndex.HEADER,
@@ -1867,6 +1887,91 @@
         self.assertEqual([], self.caught_entries)
 
 
+class TestKnitVersionedFiles(KnitTests):
+
+    def assertGroupKeysForIo(self, exp_groups, keys, non_local_keys,
+                             positions, _min_buffer_size=None):
+        kvf = self.make_test_knit()
+        if _min_buffer_size is None:
+            _min_buffer_size = knit._STREAM_MIN_BUFFER_SIZE
+        self.assertEqual(exp_groups, kvf._group_keys_for_io(keys,
+                                        non_local_keys, positions,
+                                        _min_buffer_size=_min_buffer_size))
+
+    def assertSplitByPrefix(self, expected_map, expected_prefix_order,
+                            keys):
+        split, prefix_order = KnitVersionedFiles._split_by_prefix(keys)
+        self.assertEqual(expected_map, split)
+        self.assertEqual(expected_prefix_order, prefix_order)
+
+    def test__group_keys_for_io(self):
+        ft_detail = ('fulltext', False)
+        ld_detail = ('line-delta', False)
+        f_a = ('f', 'a')
+        f_b = ('f', 'b')
+        f_c = ('f', 'c')
+        g_a = ('g', 'a')
+        g_b = ('g', 'b')
+        g_c = ('g', 'c')
+        positions = {
+            f_a: (ft_detail, (f_a, 0, 100), None),
+            f_b: (ld_detail, (f_b, 100, 21), f_a),
+            f_c: (ld_detail, (f_c, 180, 15), f_b),
+            g_a: (ft_detail, (g_a, 121, 35), None),
+            g_b: (ld_detail, (g_b, 156, 12), g_a),
+            g_c: (ld_detail, (g_c, 195, 13), g_a),
+            }
+        self.assertGroupKeysForIo([([f_a], set())],
+                                  [f_a], [], positions)
+        self.assertGroupKeysForIo([([f_a], set([f_a]))],
+                                  [f_a], [f_a], positions)
+        self.assertGroupKeysForIo([([f_a, f_b], set([]))],
+                                  [f_a, f_b], [], positions)
+        self.assertGroupKeysForIo([([f_a, f_b], set([f_b]))],
+                                  [f_a, f_b], [f_b], positions)
+        self.assertGroupKeysForIo([([f_a, f_b, g_a, g_b], set())],
+                                  [f_a, g_a, f_b, g_b], [], positions)
+        self.assertGroupKeysForIo([([f_a, f_b, g_a, g_b], set())],
+                                  [f_a, g_a, f_b, g_b], [], positions,
+                                  _min_buffer_size=150)
+        self.assertGroupKeysForIo([([f_a, f_b], set()), ([g_a, g_b], set())],
+                                  [f_a, g_a, f_b, g_b], [], positions,
+                                  _min_buffer_size=100)
+        self.assertGroupKeysForIo([([f_c], set()), ([g_b], set())],
+                                  [f_c, g_b], [], positions,
+                                  _min_buffer_size=125)
+        self.assertGroupKeysForIo([([g_b, f_c], set())],
+                                  [g_b, f_c], [], positions,
+                                  _min_buffer_size=125)
+
+    def test__split_by_prefix(self):
+        self.assertSplitByPrefix({'f': [('f', 'a'), ('f', 'b')],
+                                  'g': [('g', 'b'), ('g', 'a')],
+                                 }, ['f', 'g'],
+                                 [('f', 'a'), ('g', 'b'),
+                                  ('g', 'a'), ('f', 'b')])
+
+        self.assertSplitByPrefix({'f': [('f', 'a'), ('f', 'b')],
+                                  'g': [('g', 'b'), ('g', 'a')],
+                                 }, ['f', 'g'],
+                                 [('f', 'a'), ('f', 'b'),
+                                  ('g', 'b'), ('g', 'a')])
+
+        self.assertSplitByPrefix({'f': [('f', 'a'), ('f', 'b')],
+                                  'g': [('g', 'b'), ('g', 'a')],
+                                 }, ['f', 'g'],
+                                 [('f', 'a'), ('f', 'b'),
+                                  ('g', 'b'), ('g', 'a')])
+
+        self.assertSplitByPrefix({'f': [('f', 'a'), ('f', 'b')],
+                                  'g': [('g', 'b'), ('g', 'a')],
+                                  '': [('a',), ('b',)]
+                                 }, ['f', 'g', ''],
+                                 [('f', 'a'), ('g', 'b'),
+                                  ('a',), ('b',),
+                                  ('g', 'a'), ('f', 'b')])
+
+
 class TestStacking(KnitTests):
 
     def get_basis_and_test_knit(self):