Rev 3688: (jam) Streamline BTreeBuilder.add_node et al to make btree creation in file:///home/pqm/archives/thelove/bzr/%2Btrunk/

Wed Sep 3 23:31:06 BST 2008

At file:///home/pqm/archives/thelove/bzr/%2Btrunk/

------------------------------------------------------------
revno: 3688
revision-id: pqm at pqm.ubuntu.com-20080903223056-b108iytb38xkznci
parent: pqm at pqm.ubuntu.com-20080903220002-hit46ycge6gqsr8l
parent: john at arbash-meinel.com-20080903213644-icayfa0cn3hq3skv
committer: Canonical.com Patch Queue Manager <pqm at pqm.ubuntu.com>
branch nick: +trunk
timestamp: Wed 2008-09-03 23:30:56 +0100
message:
  (jam) Streamline BTreeBuilder.add_node et al to make btree creation
  	faster.
modified:
  NEWS                           NEWS-20050323055033-4e00b5db738777ff
  bzrlib/btree_index.py          index.py-20080624222253-p0x5f92uyh5hw734-7
  bzrlib/index.py                index.py-20070712131115-lolkarso50vjr64s-1
  bzrlib/tests/test_btree_index.py test_index.py-20080624222253-p0x5f92uyh5hw734-13
    ------------------------------------------------------------
    revno: 3644.2.13
    revision-id: john at arbash-meinel.com-20080903213644-icayfa0cn3hq3skv
    parent: john at arbash-meinel.com-20080828201920-fcrvnykg1jd4rwnx
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: index_builder_cleanup
    timestamp: Wed 2008-09-03 16:36:44 -0500
    message:
      NEWS entry about the performance improvements.
    modified:
      NEWS                           NEWS-20050323055033-4e00b5db738777ff
    ------------------------------------------------------------
    revno: 3644.2.12
    revision-id: john at arbash-meinel.com-20080828201920-fcrvnykg1jd4rwnx
    parent: john at arbash-meinel.com-20080828201644-itfv9mmesr50lncd
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: index_builder_cleanup
    timestamp: Thu 2008-08-28 15:19:20 -0500
    message:
      Remove an incorrect comment.
    modified:
      bzrlib/btree_index.py          index.py-20080624222253-p0x5f92uyh5hw734-7
    ------------------------------------------------------------
    revno: 3644.2.11
    revision-id: john at arbash-meinel.com-20080828201644-itfv9mmesr50lncd
    parent: john at arbash-meinel.com-20080828201331-dqffxf54l2heokll
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: index_builder_cleanup
    timestamp: Thu 2008-08-28 15:16:44 -0500
    message:
      Document the new form of _nodes and remove an unnecessary cast.
    modified:
      bzrlib/btree_index.py          index.py-20080624222253-p0x5f92uyh5hw734-7
    ------------------------------------------------------------
    revno: 3644.2.10
    revision-id: john at arbash-meinel.com-20080828201331-dqffxf54l2heokll
    parent: john at arbash-meinel.com-20080828200552-sw5lzds9mmi3qnnb
    parent: pqm at pqm.ubuntu.com-20080828171745-xdrmccm17muk77y0
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: index_builder_cleanup
    timestamp: Thu 2008-08-28 15:13:31 -0500
    message:
      Merge bzr.dev 3658
    added:
      bzrlib/transport/ftp/          ftp-20080611185801-3vm145h8dmnfgh25-1
      bzrlib/transport/ftp/_gssapi.py _gssapi.py-20080611190840-7ejrtp884bk5eu72-2
    renamed:
      bzrlib/transport/ftp.py => bzrlib/transport/ftp/__init__.py ftp.py-20051116161804-58dc9506548c2a53
    modified:
      NEWS                           NEWS-20050323055033-4e00b5db738777ff
      bzrlib/_patiencediff_c.c       _patiencediff_c.c-20070721205602-q3imkipwlgagp3cy-1
      bzrlib/branch.py               branch.py-20050309040759-e4baf4e0d046576e
      bzrlib/btree_index.py          index.py-20080624222253-p0x5f92uyh5hw734-7
      bzrlib/builtins.py             builtins.py-20050830033751-fc01482b9ca23183
      bzrlib/chunk_writer.py         chunk_writer.py-20080630234519-6ggn4id17nipovny-1
      bzrlib/config.py               config.py-20051011043216-070c74f4e9e338e8
      bzrlib/log.py                  log.py-20050505065812-c40ce11702fe5fb1
      bzrlib/mail_client.py          mail_client.py-20070809192806-vuxt3t19srtpjpdn-1
      bzrlib/merge.py                merge.py-20050513021216-953b65a438527106
      bzrlib/plugin.py               plugin.py-20050622060424-829b654519533d69
      bzrlib/tests/blackbox/test_merge.py test_merge.py-20060323225809-9bc0459c19917f41
      bzrlib/tests/blackbox/test_missing.py test_missing.py-20051211212735-a2cf4c1840bb84c4
      bzrlib/tests/blackbox/test_non_ascii.py test_non_ascii.py-20060105214030-68010be784a5d854
      bzrlib/tests/blackbox/test_push.py test_push.py-20060329002750-929af230d5d22663
      bzrlib/tests/blackbox/test_send.py test_bundle.py-20060616222707-c21c8b7ea5ef57b1
      bzrlib/tests/test_btree_index.py test_index.py-20080624222253-p0x5f92uyh5hw734-13
      bzrlib/tests/test_chunk_writer.py test_chunk_writer.py-20080630234519-6ggn4id17nipovny-2
      bzrlib/tests/test_diff.py      testdiff.py-20050727164403-d1a3496ebb12e339
      bzrlib/tests/test_log.py       testlog.py-20050728115707-1a514809d7d49309
      bzrlib/tests/test_merge.py     testmerge.py-20050905070950-c1b5aa49ff911024
      bzrlib/tests/test_pack_repository.py test_pack_repository-20080801043947-eaw0e6h2gu75kwmy-1
      bzrlib/transport/__init__.py   transport.py-20050711165921-4978aa7ce1285ad5
      bzrlib/transport/http/_pycurl.py pycurlhttp.py-20060110060940-4e2a705911af77a6
      bzrlib/workingtree.py          workingtree.py-20050511021032-29b6ec0a681e02e3
      doc/developers/ppa.txt         ppa.txt-20080722055539-606u7t2z32t3ae4w-1
      doc/en/mini-tutorial/index.txt index.txt-20070813141352-2u64ooqzo0or4hss-2
      doc/es/mini-tutorial/index.txt index.txt-20080504182136-wmoc35u2t6kom8ca-1
    ------------------------------------------------------------
    revno: 3644.2.9
    revision-id: john at arbash-meinel.com-20080828200552-sw5lzds9mmi3qnnb
    parent: john at arbash-meinel.com-20080825190216-vdkyinz5p5e5s8kq
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: index_builder_cleanup
    timestamp: Thu 2008-08-28 15:05:52 -0500
    message:
      Refactor some code.
      Move the key, reference, value checking into a helper function.
      This func also finds absent references, but that should have
      minimal overhead either way.
      Also use the _update_nodes_by_key functionality for both
      indexes, as _nodes_by_key has the same signature.
      Move _spill_mem_keys_to_disk into a separate function
      not necessary, but it makes add_node() easier to understand.
    modified:
      bzrlib/btree_index.py          index.py-20080624222253-p0x5f92uyh5hw734-7
      bzrlib/index.py                index.py-20070712131115-lolkarso50vjr64s-1
    ------------------------------------------------------------
    revno: 3644.2.8
    revision-id: john at arbash-meinel.com-20080825190216-vdkyinz5p5e5s8kq
    parent: john at arbash-meinel.com-20080825183623-n5h3h1d5ky8yr7d6
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: index_builder_cleanup
    timestamp: Mon 2008-08-25 14:02:16 -0500
    message:
      Two quick tweaks.
      Change _iter_mem_nodes to use sorted order.
      That way we can sort purely on the 'key' which
      we know is the sort key anyway. This shaves off
      a *lot* of time spent in 'sorted()'.
      Also, if 'references' is in our output nodes,
      we know we've already checked that it is a valid
      key, so we don't need to check it again.
      This shaves another 600ms or so for a bzr.dev tree.
    modified:
      bzrlib/btree_index.py          index.py-20080624222253-p0x5f92uyh5hw734-7
    ------------------------------------------------------------
    revno: 3644.2.7
    revision-id: john at arbash-meinel.com-20080825183623-n5h3h1d5ky8yr7d6
    parent: john at arbash-meinel.com-20080825164422-8opo9lb960uev2ce
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: index_builder_cleanup
    timestamp: Mon 2008-08-25 13:36:23 -0500
    message:
      Remove accidental merge
    modified:
      bzrlib/chunk_writer.py         chunk_writer.py-20080630234519-6ggn4id17nipovny-1
    ------------------------------------------------------------
    revno: 3644.2.6
    revision-id: john at arbash-meinel.com-20080825164422-8opo9lb960uev2ce
    parent: john at arbash-meinel.com-20080825164250-glc5z58nhwzpgdo2
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: index_builder_cleanup
    timestamp: Mon 2008-08-25 11:44:22 -0500
    message:
      Restore the exact old tests, only assert that
      _nodes_by_key is None, rather than an empty dict.
    modified:
      bzrlib/tests/test_btree_index.py test_index.py-20080624222253-p0x5f92uyh5hw734-13
    ------------------------------------------------------------
    revno: 3644.2.5
    revision-id: john at arbash-meinel.com-20080825164250-glc5z58nhwzpgdo2
    parent: john at arbash-meinel.com-20080825162901-1qweamxqdvv59ie9
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: index_builder_cleanup
    timestamp: Mon 2008-08-25 11:42:50 -0500
    message:
      Restore the exact old tests, only assert that
      _nodes_by_key is None, rather than an empty dict.
    modified:
      bzrlib/btree_index.py          index.py-20080624222253-p0x5f92uyh5hw734-7
      bzrlib/tests/test_btree_index.py test_index.py-20080624222253-p0x5f92uyh5hw734-13
    ------------------------------------------------------------
    revno: 3644.2.4
    revision-id: john at arbash-meinel.com-20080825162901-1qweamxqdvv59ie9
    parent: john at arbash-meinel.com-20080825162409-0766y19zjs45m87i
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: index_builder_cleanup
    timestamp: Mon 2008-08-25 11:29:01 -0500
    message:
      Change GraphIndex to also have a _get_nodes_by_key
    modified:
      bzrlib/index.py                index.py-20070712131115-lolkarso50vjr64s-1
    ------------------------------------------------------------
    revno: 3644.2.3
    revision-id: john at arbash-meinel.com-20080825162409-0766y19zjs45m87i
    parent: john at arbash-meinel.com-20080825034342-owq0858uk1wp2q0l
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: index_builder_cleanup
    timestamp: Mon 2008-08-25 11:24:09 -0500
    message:
      Do a bit more work to get all the tests to pass.
    modified:
      bzrlib/btree_index.py          index.py-20080624222253-p0x5f92uyh5hw734-7
      bzrlib/chunk_writer.py         chunk_writer.py-20080630234519-6ggn4id17nipovny-1
      bzrlib/index.py                index.py-20070712131115-lolkarso50vjr64s-1
      bzrlib/tests/test_btree_index.py test_index.py-20080624222253-p0x5f92uyh5hw734-13
    ------------------------------------------------------------
    revno: 3644.2.2
    revision-id: john at arbash-meinel.com-20080825034342-owq0858uk1wp2q0l
    parent: john at arbash-meinel.com-20080825034139-68nxqiqrmqi1l5f0
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: index_builder_cleanup
    timestamp: Sun 2008-08-24 22:43:42 -0500
    message:
      the new btree index doesn't have 'absent' keys in its _nodes
    modified:
      bzrlib/btree_index.py          index.py-20080624222253-p0x5f92uyh5hw734-7
    ------------------------------------------------------------
    revno: 3644.2.1
    revision-id: john at arbash-meinel.com-20080825034139-68nxqiqrmqi1l5f0
    parent: pqm at pqm.ubuntu.com-20080822042630-on3dxyek4ezk0miu
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: index_builder_cleanup
    timestamp: Sun 2008-08-24 22:41:39 -0500
    message:
      Change the IndexBuilders to not generate the nodes_by_key unless needed.
    modified:
      bzrlib/btree_index.py          index.py-20080624222253-p0x5f92uyh5hw734-7
      bzrlib/index.py                index.py-20070712131115-lolkarso50vjr64s-1
      bzrlib/tests/test_btree_index.py test_index.py-20080624222253-p0x5f92uyh5hw734-13
=== modified file 'NEWS'

--- a/NEWS	2008-09-03 20:58:40 +0000
+++ b/NEWS	2008-09-03 22:30:56 +0000
@@ -145,6 +145,11 @@
       is unknown in both source and target.
       (Robert Collins, Aaron Bentley)
 
+    * ``GraphIndexBuilder.add_node`` and ``BTreeBuilder`` have been
+      streamlined a bit. This should make creating large indexes faster.
+      (In benchmarking, it now takes less time to create a BTree index than
+      it takes to read the GraphIndex one.) (John Arbash Meinel)
+
     * Mail clients for `bzr send` are now listed in a registry.  This
       allows plugins to add new clients by registering them with
       ``bzrlib.mail_client.mail_client_registry``.  All of the built-in

=== modified file 'bzrlib/btree_index.py'
--- a/bzrlib/btree_index.py	2008-08-28 01:59:58 +0000
+++ b/bzrlib/btree_index.py	2008-08-28 20:19:20 +0000
@@ -37,7 +37,6 @@
     trace,
     )
 from bzrlib.index import _OPTION_NODE_REFS, _OPTION_KEY_ELEMENTS, _OPTION_LEN
-from bzrlib.osutils import basename, dirname
 from bzrlib.transport import get_transport
 
 
@@ -78,8 +77,10 @@
             del byte_lines[-1]
             skipped_bytes = padding
         self.spool.writelines(byte_lines)
-        if (self.spool.tell() + skipped_bytes) % _PAGE_SIZE != 0:
-            raise AssertionError("incorrect node length")
+        remainder = (self.spool.tell() + skipped_bytes) % _PAGE_SIZE
+        if remainder != 0:
+            raise AssertionError("incorrect node length: %d, %d"
+                                 % (self.spool.tell(), remainder))
         self.nodes += 1
         self.writer = None
 
@@ -135,6 +136,10 @@
             key_elements=key_elements)
         self._spill_at = spill_at
         self._backing_indices = []
+        # A map of {key: (node_refs, value)}
+        self._nodes = {}
+        # Indicate it hasn't been built yet
+        self._nodes_by_key = None
 
     def add_node(self, key, value, references=()):
         """Add a node to the index.
@@ -150,10 +155,32 @@
         :param value: The value to associate with the key. It may be any
             bytes as long as it does not contain \0 or \n.
         """
-        index.GraphIndexBuilder.add_node(self, key, value, references=references)
+        # we don't care about absent_references
+        node_refs, _ = self._check_key_ref_value(key, references, value)
+        if key in self._nodes:
+            raise errors.BadIndexDuplicateKey(key, self)
+        self._nodes[key] = (node_refs, value)
+        self._keys.add(key)
+        if self._nodes_by_key is not None and self._key_length > 1:
+            self._update_nodes_by_key(key, value, node_refs)
         if len(self._keys) < self._spill_at:
             return
-        iterators_to_combine = [iter(sorted(self._iter_mem_nodes()))]
+        self._spill_mem_keys_to_disk()
+
+    def _spill_mem_keys_to_disk(self):
+        """Write the in memory keys down to disk to cap memory consumption.
+
+        If we already have some keys written to disk, we will combine them so
+        as to preserve the sorted order.  The algorithm for combining uses
+        powers of two.  So on the first spill, write all mem nodes into a
+        single index. On the second spill, combine the mem nodes with the nodes
+        on disk to create a 2x sized disk index and get rid of the first index.
+        On the third spill, create a single new disk index, which will contain
+        the mem nodes, and preserve the existing 2x sized index.  On the fourth,
+        combine mem with the first and second indexes, creating a new one of
+        size 4x. On the fifth create a single new one, etc.
+        """
+        iterators_to_combine = [self._iter_mem_nodes()]
         pos = -1
         for pos, backing in enumerate(self._backing_indices):
             if backing is None:
@@ -163,9 +190,11 @@
         backing_pos = pos + 1
         new_backing_file, size = \
             self._write_nodes(self._iter_smallest(iterators_to_combine))
-        new_backing = BTreeGraphIndex(
-            get_transport(dirname(new_backing_file.name)),
-            basename(new_backing_file.name), size)
+        dir_path, base_name = osutils.split(new_backing_file.name)
+        # Note: The transport here isn't strictly needed, because we will use
+        #       direct access to the new_backing._file object
+        new_backing = BTreeGraphIndex(get_transport(dir_path),
+                                      base_name, size)
         # GC will clean up the file
         new_backing._file = new_backing_file
         if len(self._backing_indices) == backing_pos:
@@ -175,7 +204,7 @@
             self._backing_indices[pos] = None
         self._keys = set()
         self._nodes = {}
-        self._nodes_by_key = {}
+        self._nodes_by_key = None
 
     def add_nodes(self, nodes):
         """Add nodes to the index.
@@ -191,14 +220,15 @@
 
     def _iter_mem_nodes(self):
         """Iterate over the nodes held in memory."""
+        nodes = self._nodes
         if self.reference_lists:
-            for key, (absent, references, value) in self._nodes.iteritems():
-                if not absent:
-                    yield self, key, value, references
+            for key in sorted(nodes):
+                references, value = nodes[key]
+                yield self, key, value, references
         else:
-            for key, (absent, references, value) in self._nodes.iteritems():
-                if not absent:
-                    yield self, key, value
+            for key in sorted(nodes):
+                references, value = nodes[key]
+                yield self, key, value
 
     def _iter_smallest(self, iterators_to_combine):
         if len(iterators_to_combine) == 1:
@@ -358,7 +388,10 @@
             copied_len = osutils.pumpfile(row.spool, result)
             if copied_len != (row.nodes - 1) * _PAGE_SIZE:
                 if type(row) != _LeafBuilderRow:
-                    raise AssertionError("Not enough data copied")
+                    raise AssertionError("Incorrect amount of data copied"
+                        " expected: %d, got: %d"
+                        % ((row.nodes - 1) * _PAGE_SIZE,
+                           copied_len))
         result.flush()
         size = result.tell()
         result.seek(0)
@@ -384,7 +417,7 @@
                 "iter_all_entries scales with size of history.")
         # Doing serial rather than ordered would be faster; but this shouldn't
         # be getting called routinely anyway.
-        iterators = [iter(sorted(self._iter_mem_nodes()))]
+        iterators = [self._iter_mem_nodes()]
         for backing in self._backing_indices:
             if backing is not None:
                 iterators.append(backing.iter_all_entries())
@@ -404,13 +437,11 @@
         if self.reference_lists:
             for key in keys.intersection(self._keys):
                 node = self._nodes[key]
-                if not node[0]:
-                    yield self, key, node[2], node[1]
+                yield self, key, node[1], node[0]
         else:
             for key in keys.intersection(self._keys):
                 node = self._nodes[key]
-                if not node[0]:
-                    yield self, key, node[2]
+                yield self, key, node[1]
         keys.difference_update(self._keys)
         for backing in self._backing_indices:
             if backing is None:
@@ -459,12 +490,10 @@
                     node = self._nodes[key]
                 except KeyError:
                     continue
-                if node[0]:
-                    continue
                 if self.reference_lists:
-                    yield self, key, node[2], node[1]
+                    yield self, key, node[1], node[0]
                 else:
-                    yield self, key, node[2]
+                    yield self, key, node[1]
             return
         for key in keys:
             # sanity check
@@ -473,7 +502,7 @@
             if len(key) != self._key_length:
                 raise errors.BadIndexKey(key)
             # find what it refers to:
-            key_dict = self._nodes_by_key
+            key_dict = self._get_nodes_by_key()
             elements = list(key)
             # find the subdict to return
             try:
@@ -499,6 +528,24 @@
             else:
                 yield (self, ) + key_dict
 
+    def _get_nodes_by_key(self):
+        if self._nodes_by_key is None:
+            nodes_by_key = {}
+            if self.reference_lists:
+                for key, (references, value) in self._nodes.iteritems():
+                    key_dict = nodes_by_key
+                    for subkey in key[:-1]:
+                        key_dict = key_dict.setdefault(subkey, {})
+                    key_dict[key[-1]] = key, value, references
+            else:
+                for key, (references, value) in self._nodes.iteritems():
+                    key_dict = nodes_by_key
+                    for subkey in key[:-1]:
+                        key_dict = key_dict.setdefault(subkey, {})
+                    key_dict[key[-1]] = key, value
+            self._nodes_by_key = nodes_by_key
+        return self._nodes_by_key
+
     def key_count(self):
         """Return an estimate of the number of keys in this index.
 

=== modified file 'bzrlib/index.py'
--- a/bzrlib/index.py	2008-09-02 17:52:00 +0000
+++ b/bzrlib/index.py	2008-09-03 22:30:56 +0000
@@ -80,8 +80,9 @@
         """
         self.reference_lists = reference_lists
         self._keys = set()
+        # A dict of {key: (absent, ref_lists, value)}
         self._nodes = {}
-        self._nodes_by_key = {}
+        self._nodes_by_key = None
         self._key_length = key_elements
 
     def _check_key(self, key):
@@ -94,16 +95,61 @@
             if not element or _whitespace_re.search(element) is not None:
                 raise errors.BadIndexKey(element)
 
-    def add_node(self, key, value, references=()):
-        """Add a node to the index.
-
-        :param key: The key. keys are non-empty tuples containing
-            as many whitespace-free utf8 bytestrings as the key length
-            defined for this index.
-        :param references: An iterable of iterables of keys. Each is a
-            reference to another key.
-        :param value: The value to associate with the key. It may be any
-            bytes as long as it does not contain \0 or \n.
+    def _get_nodes_by_key(self):
+        if self._nodes_by_key is None:
+            nodes_by_key = {}
+            if self.reference_lists:
+                for key, (absent, references, value) in self._nodes.iteritems():
+                    if absent:
+                        continue
+                    key_dict = nodes_by_key
+                    for subkey in key[:-1]:
+                        key_dict = key_dict.setdefault(subkey, {})
+                    key_dict[key[-1]] = key, value, references
+            else:
+                for key, (absent, references, value) in self._nodes.iteritems():
+                    if absent:
+                        continue
+                    key_dict = nodes_by_key
+                    for subkey in key[:-1]:
+                        key_dict = key_dict.setdefault(subkey, {})
+                    key_dict[key[-1]] = key, value
+            self._nodes_by_key = nodes_by_key
+        return self._nodes_by_key
+
+    def _update_nodes_by_key(self, key, value, node_refs):
+        """Update the _nodes_by_key dict with a new key.
+
+        For a key of (foo, bar, baz) create
+        _nodes_by_key[foo][bar][baz] = key_value
+        """
+        if self._nodes_by_key is None:
+            return
+        key_dict = self._nodes_by_key
+        if self.reference_lists:
+            key_value = key, value, node_refs
+        else:
+            key_value = key, value
+        for subkey in key[:-1]:
+            key_dict = key_dict.setdefault(subkey, {})
+        key_dict[key[-1]] = key_value
+
+    def _check_key_ref_value(self, key, references, value):
+        """Check that 'key' and 'references' are all valid.
+
+        :param key: A key tuple. Must conform to the key interface (be a tuple,
+            be of the right length, not have any whitespace or nulls in any key
+            element.)
+        :param references: An iterable of reference lists. Something like
+            [[(ref, key)], [(ref, key), (other, key)]]
+        :param value: The value associate with this key. Must not contain
+            newlines or null characters.
+        :return: (node_refs, absent_references)
+            node_refs   basically a packed form of 'references' where all
+                        iterables are tuples
+            absent_references   reference keys that are not in self._nodes.
+                                This may contain duplicates if the same key is
+                                referenced in multiple lists.
         """
         self._check_key(key)
         if _newline_null_re.search(value) is not None:
@@ -111,29 +157,40 @@
         if len(references) != self.reference_lists:
             raise errors.BadIndexValue(references)
         node_refs = []
+        absent_references = []
         for reference_list in references:
             for reference in reference_list:
-                self._check_key(reference)
+                # If reference *is* in self._nodes, then we know it has already
+                # been checked.
                 if reference not in self._nodes:
-                    self._nodes[reference] = ('a', (), '')
+                    self._check_key(reference)
+                    absent_references.append(reference)
             node_refs.append(tuple(reference_list))
-        if key in self._nodes and self._nodes[key][0] == '':
+        return tuple(node_refs), absent_references
+
+    def add_node(self, key, value, references=()):
+        """Add a node to the index.
+
+        :param key: The key. keys are non-empty tuples containing
+            as many whitespace-free utf8 bytestrings as the key length
+            defined for this index.
+        :param references: An iterable of iterables of keys. Each is a
+            reference to another key.
+        :param value: The value to associate with the key. It may be any
+            bytes as long as it does not contain \0 or \n.
+        """
+        (node_refs,
+         absent_references) = self._check_key_ref_value(key, references, value)
+        if key in self._nodes and self._nodes[key][0] != 'a':
             raise errors.BadIndexDuplicateKey(key, self)
-        self._nodes[key] = ('', tuple(node_refs), value)
+        for reference in absent_references:
+            # There may be duplicates, but I don't think it is worth worrying
+            # about
+            self._nodes[reference] = ('a', (), '')
+        self._nodes[key] = ('', node_refs, value)
         self._keys.add(key)
-        if self._key_length > 1:
-            key_dict = self._nodes_by_key
-            if self.reference_lists:
-                key_value = key, value, tuple(node_refs)
-            else:
-                key_value = key, value
-            # possibly should do this on-demand, but it seems likely it is 
-            # always wanted
-            # For a key of (foo, bar, baz) create
-            # _nodes_by_key[foo][bar][baz] = key_value
-            for subkey in key[:-1]:
-                key_dict = key_dict.setdefault(subkey, {})
-            key_dict[key[-1]] = key_value
+        if self._nodes_by_key is not None and self._key_length > 1:
+            self._update_nodes_by_key(key, value, node_refs)
 
     def finish(self):
         lines = [_SIGNATURE]
@@ -142,7 +199,7 @@
         lines.append(_OPTION_LEN + str(len(self._keys)) + '\n')
         prefix_length = sum(len(x) for x in lines)
         # references are byte offsets. To avoid having to do nasty
-        # polynomial work to resolve offsets (references to later in the 
+        # polynomial work to resolve offsets (references to later in the
         # file cannot be determined until all the inbetween references have
         # been calculated too) we pad the offsets with 0's to make them be
         # of consistent length. Using binary offsets would break the trivial
@@ -325,14 +382,14 @@
                 node_value = value
             self._nodes[key] = node_value
             if self._key_length > 1:
-                subkey = list(reversed(key[:-1]))
+                # TODO: We may want to do this lazily, but if we are calling
+                #       _buffer_all, we are likely to be doing
+                #       iter_entries_prefix
                 key_dict = self._nodes_by_key
                 if self.node_ref_lists:
                     key_value = key, node_value[0], node_value[1]
                 else:
                     key_value = key, node_value
-                # possibly should do this on-demand, but it seems likely it is 
-                # always wanted
                 # For a key of (foo, bar, baz) create
                 # _nodes_by_key[foo][bar][baz] = key_value
                 for subkey in key[:-1]:
@@ -1275,6 +1332,7 @@
                 else:
                     yield self, key, node[2]
             return
+        nodes_by_key = self._get_nodes_by_key()
         for key in keys:
             # sanity check
             if key[0] is None:
@@ -1282,7 +1340,7 @@
             if len(key) != self._key_length:
                 raise errors.BadIndexKey(key)
             # find what it refers to:
-            key_dict = self._nodes_by_key
+            key_dict = nodes_by_key
             elements = list(key)
             # find the subdict to return
             try:

=== modified file 'bzrlib/tests/test_btree_index.py'
--- a/bzrlib/tests/test_btree_index.py	2008-08-26 00:56:10 +0000
+++ b/bzrlib/tests/test_btree_index.py	2008-08-28 20:13:31 +0000
@@ -354,23 +354,23 @@
         # predictably.
         self.assertEqual(1, len(builder._nodes))
         self.assertEqual(1, len(builder._keys))
-        self.assertEqual({}, builder._nodes_by_key)
+        self.assertIs(None, builder._nodes_by_key)
         builder.add_node(*nodes[1])
         self.assertEqual(0, len(builder._nodes))
         self.assertEqual(0, len(builder._keys))
-        self.assertEqual({}, builder._nodes_by_key)
+        self.assertIs(None, builder._nodes_by_key)
         self.assertEqual(1, len(builder._backing_indices))
         self.assertEqual(2, builder._backing_indices[0].key_count())
         # now back to memory
         builder.add_node(*nodes[2])
         self.assertEqual(1, len(builder._nodes))
         self.assertEqual(1, len(builder._keys))
-        self.assertEqual({}, builder._nodes_by_key)
+        self.assertIs(None, builder._nodes_by_key)
         # And spills to a second backing index combing all
         builder.add_node(*nodes[3])
         self.assertEqual(0, len(builder._nodes))
         self.assertEqual(0, len(builder._keys))
-        self.assertEqual({}, builder._nodes_by_key)
+        self.assertIs(None, builder._nodes_by_key)
         self.assertEqual(2, len(builder._backing_indices))
         self.assertEqual(None, builder._backing_indices[0])
         self.assertEqual(4, builder._backing_indices[1].key_count())
@@ -379,7 +379,7 @@
         builder.add_node(*nodes[5])
         self.assertEqual(0, len(builder._nodes))
         self.assertEqual(0, len(builder._keys))
-        self.assertEqual({}, builder._nodes_by_key)
+        self.assertIs(None, builder._nodes_by_key)
         self.assertEqual(2, len(builder._backing_indices))
         self.assertEqual(2, builder._backing_indices[0].key_count())
         self.assertEqual(4, builder._backing_indices[1].key_count())
@@ -443,24 +443,28 @@
         # Test the parts of the index that take up memory are doing so
         # predictably.
         self.assertEqual(1, len(builder._keys))
-        self.assertEqual(2, len(builder._nodes))
-        self.assertNotEqual({}, builder._nodes_by_key)
+        self.assertEqual(1, len(builder._nodes))
+        self.assertIs(None, builder._nodes_by_key)
         builder.add_node(*nodes[1])
         self.assertEqual(0, len(builder._keys))
         self.assertEqual(0, len(builder._nodes))
-        self.assertEqual({}, builder._nodes_by_key)
+        self.assertIs(None, builder._nodes_by_key)
         self.assertEqual(1, len(builder._backing_indices))
         self.assertEqual(2, builder._backing_indices[0].key_count())
         # now back to memory
+        old = dict(builder._get_nodes_by_key()) #Build up the nodes by key dict
         builder.add_node(*nodes[2])
-        self.assertEqual(2, len(builder._nodes))
+        self.assertEqual(1, len(builder._nodes))
         self.assertEqual(1, len(builder._keys))
+        self.assertIsNot(None, builder._nodes_by_key)
         self.assertNotEqual({}, builder._nodes_by_key)
+        # We should have a new entry
+        self.assertNotEqual(old, builder._nodes_by_key)
         # And spills to a second backing index combing all
         builder.add_node(*nodes[3])
         self.assertEqual(0, len(builder._nodes))
         self.assertEqual(0, len(builder._keys))
-        self.assertEqual({}, builder._nodes_by_key)
+        self.assertIs(None, builder._nodes_by_key)
         self.assertEqual(2, len(builder._backing_indices))
         self.assertEqual(None, builder._backing_indices[0])
         self.assertEqual(4, builder._backing_indices[1].key_count())
@@ -469,7 +473,7 @@
         builder.add_node(*nodes[5])
         self.assertEqual(0, len(builder._nodes))
         self.assertEqual(0, len(builder._keys))
-        self.assertEqual({}, builder._nodes_by_key)
+        self.assertIs(None, builder._nodes_by_key)
         self.assertEqual(2, len(builder._backing_indices))
         self.assertEqual(2, builder._backing_indices[0].key_count())
         self.assertEqual(4, builder._backing_indices[1].key_count())