Rev 3912: (jam) Add a cache for deserializing inventory entries from XML. in file:///home/pqm/archives/thelove/bzr/%2Btrunk/

Canonical.com Patch Queue Manager pqm at pqm.ubuntu.com
Fri Dec 19 17:15:05 GMT 2008


At file:///home/pqm/archives/thelove/bzr/%2Btrunk/

------------------------------------------------------------
revno: 3912
revision-id: pqm at pqm.ubuntu.com-20081219171459-521qbou7ho7g297f
parent: pqm at pqm.ubuntu.com-20081219065652-z3g78j4hrvdnf8bj
parent: john at arbash-meinel.com-20081213031940-goymz22b10o9zu32
committer: Canonical.com Patch Queue Manager <pqm at pqm.ubuntu.com>
branch nick: +trunk
timestamp: Fri 2008-12-19 17:14:59 +0000
message:
  (jam) Add a cache for deserializing inventory entries from XML.
modified:
  NEWS                           NEWS-20050323055033-4e00b5db738777ff
  bzrlib/fifo_cache.py           fifo_cache.py-20081209212307-31ffjwvteyvmydnf-1
  bzrlib/inventory.py            inventory.py-20050309040759-6648b84ca2005b37
  bzrlib/repository.py           rev_storage.py-20051111201905-119e9401e46257e3
  bzrlib/tests/__init__.py       selftest.py-20050531073622-8d0e3c8845c97a64
  bzrlib/tests/test_fifo_cache.py test_fifo_cache.py-20081209212307-31ffjwvteyvmydnf-2
  bzrlib/xml4.py                 xml4.py-20050916091259-db5ab55e7e6ca324
  bzrlib/xml5.py                 xml5.py-20080328030717-t9guwinq8hom0ar3-1
  bzrlib/xml7.py                 xml7.py-20061029182747-d5tiiny21bvrd2jj-1
  bzrlib/xml8.py                 xml5.py-20050907032657-aac8f960815b66b1
  bzrlib/xml_serializer.py       xml.py-20050309040759-57d51586fdec365d
    ------------------------------------------------------------
    revno: 3882.6.23
    revision-id: john at arbash-meinel.com-20081213031940-goymz22b10o9zu32
    parent: john at arbash-meinel.com-20081212200628-xmm9i33jq3d6tsh3
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: xml_cache
    timestamp: Fri 2008-12-12 21:19:40 -0600
    message:
      Change the XMLSerializer.read_inventory_from_string api.
      
      This allows us to pass in the entry cache, rather than using a global.
      This gives a lifetime to the cache, and eliminates some of the
      concerns about expecting a different IE from different serializers, etc.
      
      The cache is also cleared when the repo is unlocked.
    modified:
      bzrlib/repository.py           rev_storage.py-20051111201905-119e9401e46257e3
      bzrlib/xml5.py                 xml5.py-20080328030717-t9guwinq8hom0ar3-1
      bzrlib/xml8.py                 xml5.py-20050907032657-aac8f960815b66b1
      bzrlib/xml_serializer.py       xml.py-20050309040759-57d51586fdec365d
    ------------------------------------------------------------
    revno: 3882.6.22
    revision-id: john at arbash-meinel.com-20081212200628-xmm9i33jq3d6tsh3
    parent: john at arbash-meinel.com-20081211223447-wqyc0ynjs2w59r3s
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: xml_cache
    timestamp: Fri 2008-12-12 14:06:28 -0600
    message:
      Start moving things around so that the entry cache is passed in.
      
      This has a negligible effect on performance, and means that we can have the
      cache lifetime associated with a repository, rather than 'always on'.
    modified:
      bzrlib/tests/__init__.py       selftest.py-20050531073622-8d0e3c8845c97a64
      bzrlib/xml4.py                 xml4.py-20050916091259-db5ab55e7e6ca324
      bzrlib/xml5.py                 xml5.py-20080328030717-t9guwinq8hom0ar3-1
      bzrlib/xml7.py                 xml7.py-20061029182747-d5tiiny21bvrd2jj-1
      bzrlib/xml8.py                 xml5.py-20050907032657-aac8f960815b66b1
    ------------------------------------------------------------
    revno: 3882.6.21
    revision-id: john at arbash-meinel.com-20081211223447-wqyc0ynjs2w59r3s
    parent: john at arbash-meinel.com-20081211222229-c9l7lebfgrzb5pme
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: xml_cache
    timestamp: Thu 2008-12-11 16:34:47 -0600
    message:
      Don't cache the InventoryEntry we will return, callers mutate those objects.
    modified:
      bzrlib/xml8.py                 xml5.py-20050907032657-aac8f960815b66b1
    ------------------------------------------------------------
    revno: 3882.6.20
    revision-id: john at arbash-meinel.com-20081211222229-c9l7lebfgrzb5pme
    parent: john at arbash-meinel.com-20081211213610-p2ovasy20mc5a2j1
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: xml_cache
    timestamp: Thu 2008-12-11 16:22:29 -0600
    message:
      Clear out the InventoryEntry caches as part of the test suite.
    modified:
      bzrlib/tests/__init__.py       selftest.py-20050531073622-8d0e3c8845c97a64
    ------------------------------------------------------------
    revno: 3882.6.19
    revision-id: john at arbash-meinel.com-20081211213610-p2ovasy20mc5a2j1
    parent: john at arbash-meinel.com-20081211205928-rmjy6scj28u7l8l3
    parent: pqm at pqm.ubuntu.com-20081211202300-6dz1vo3phfsc23pj
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: xml_cache
    timestamp: Thu 2008-12-11 15:36:10 -0600
    message:
      Merge bzr.dev 3895, resolve minor conflict in bzrlib/repository.py
    added:
      bzrlib/_chunks_to_lines_py.py  _chunks_to_lines_py.-20081211024848-6uc3mtuje8j14l60-1
      bzrlib/_chunks_to_lines_pyx.pyx _chunks_to_lines_pyx-20081211021736-op7n8vrxgrd8snfi-1
      bzrlib/tests/blackbox/test_shelve.py test_ls_shelf.py-20081202053526-thlo8yt0pi1cgor1-1
      bzrlib/tests/test__chunks_to_lines.py test__chunks_to_line-20081211024848-6uc3mtuje8j14l60-2
    modified:
      .bzrignore                     bzrignore-20050311232317-81f7b71efa2db11a
      NEWS                           NEWS-20050323055033-4e00b5db738777ff
      bzrlib/builtins.py             builtins.py-20050830033751-fc01482b9ca23183
      bzrlib/graph.py                graph_walker.py-20070525030359-y852guab65d4wtn0-1
      bzrlib/knit.py                 knit.py-20051212171256-f056ac8f0fbe1bd9
      bzrlib/merge.py                merge.py-20050513021216-953b65a438527106
      bzrlib/osutils.py              osutils.py-20050309040759-eeaff12fbf77ac86
      bzrlib/remote.py               remote.py-20060720103555-yeeg2x51vn0rbtdp-1
      bzrlib/repository.py           rev_storage.py-20051111201905-119e9401e46257e3
      bzrlib/shelf.py                prepare_shelf.py-20081005181341-n74qe6gu1e65ad4v-1
      bzrlib/tests/__init__.py       selftest.py-20050531073622-8d0e3c8845c97a64
      bzrlib/tests/blackbox/__init__.py __init__.py-20051128053524-eba30d8255e08dc3
      bzrlib/tests/per_repository/test_add_fallback_repository.py test_add_fallback_re-20080215040003-8w9n4ck9uqdxj18m-1
      bzrlib/tests/test_graph.py     test_graph_walker.py-20070525030405-enq4r60hhi9xrujc-1
      bzrlib/tests/test_osutils.py   test_osutils.py-20051201224856-e48ee24c12182989
      bzrlib/tests/test_remote.py    test_remote.py-20060720103555-yeeg2x51vn0rbtdp-2
      bzrlib/tests/test_shelf.py     test_prepare_shelf.p-20081005181341-n74qe6gu1e65ad4v-2
      bzrlib/tests/test_versionedfile.py test_versionedfile.py-20060222045249-db45c9ed14a1c2e5
      bzrlib/transform.py            transform.py-20060105172343-dd99e54394d91687
      bzrlib/versionedfile.py        versionedfile.py-20060222045106-5039c71ee3b65490
      bzrlib/weave.py                knit.py-20050627021749-759c29984154256b
      setup.py                       setup.py-20050314065409-02f8a0a6e3f9bc70
    ------------------------------------------------------------
    revno: 3882.6.18
    revision-id: john at arbash-meinel.com-20081211205928-rmjy6scj28u7l8l3
    parent: john at arbash-meinel.com-20081211203350-iur6tsuvq9gtswe9
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: xml_cache
    timestamp: Thu 2008-12-11 14:59:28 -0600
    message:
      Bring in optimizations to Inventory._make_delta.
    modified:
      bzrlib/inventory.py            inventory.py-20050309040759-6648b84ca2005b37
    ------------------------------------------------------------
    revno: 3882.6.17
    revision-id: john at arbash-meinel.com-20081211203350-iur6tsuvq9gtswe9
    parent: john at arbash-meinel.com-20081211202214-ebsb9s3xua0lhs5q
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: xml_cache
    timestamp: Thu 2008-12-11 14:33:50 -0600
    message:
      Added NEWS entry for the XML inventory entry cache.
      
      Also add a forgotten entry for the FIFOCache code.
    modified:
      NEWS                           NEWS-20050323055033-4e00b5db738777ff
    ------------------------------------------------------------
    revno: 3882.6.16
    revision-id: john at arbash-meinel.com-20081211202214-ebsb9s3xua0lhs5q
    parent: john at arbash-meinel.com-20081211200919-0cymakwpvgc5xjtn
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: xml_cache
    timestamp: Thu 2008-12-11 14:22:14 -0600
    message:
      Update InventoryEntry.__eq__ in case we enable caching without .copy()
    modified:
      bzrlib/inventory.py            inventory.py-20050309040759-6648b84ca2005b37
    ------------------------------------------------------------
    revno: 3882.6.15
    revision-id: john at arbash-meinel.com-20081211200919-0cymakwpvgc5xjtn
    parent: john at arbash-meinel.com-20081210230600-d84wjysaeu1caoea
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: xml_cache
    timestamp: Thu 2008-12-11 14:09:19 -0600
    message:
      Change the _entry_cache into class specific.
      
      This means that each Serializer will have its own entry_cache.
      But that will ensure any differences in results from different
      serializers will not be propagated. And it doesn't matter for
      most use cases, as we are really only dealing with one serializer
      at a time.
    modified:
      bzrlib/xml8.py                 xml5.py-20050907032657-aac8f960815b66b1
    ------------------------------------------------------------
    revno: 3882.6.14
    revision-id: john at arbash-meinel.com-20081210230600-d84wjysaeu1caoea
    parent: john at arbash-meinel.com-20081210230521-t1t4d6yfh8kt6ft8
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: xml_cache
    timestamp: Wed 2008-12-10 17:06:00 -0600
    message:
      Release inventory xml strings that we've already processed.
      
      This doesn't decrease our peak memory consumption, but at least it decreases it
      a little bit over time.
    modified:
      bzrlib/repository.py           rev_storage.py-20051111201905-119e9401e46257e3
    ------------------------------------------------------------
    revno: 3882.6.13
    revision-id: john at arbash-meinel.com-20081210230521-t1t4d6yfh8kt6ft8
    parent: john at arbash-meinel.com-20081210223943-aqedq91tf7e6tecs
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: xml_cache
    timestamp: Wed 2008-12-10 17:05:21 -0600
    message:
      We don't need to inline get_cached until we've had the miss.
    modified:
      bzrlib/xml8.py                 xml5.py-20050907032657-aac8f960815b66b1
    ------------------------------------------------------------
    revno: 3882.6.12
    revision-id: john at arbash-meinel.com-20081210223943-aqedq91tf7e6tecs
    parent: john at arbash-meinel.com-20081210222704-465gxu7k0wehug6o
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: xml_cache
    timestamp: Wed 2008-12-10 16:39:43 -0600
    message:
      Use resize logic to ensure our inventory entry cache is at an optimal size.
      
      We don't want to cache all entries across all inventories, but we are okay caching a
      bit more than would fit in one total inventory.
    modified:
      bzrlib/xml5.py                 xml5.py-20080328030717-t9guwinq8hom0ar3-1
      bzrlib/xml8.py                 xml5.py-20050907032657-aac8f960815b66b1
    ------------------------------------------------------------
    revno: 3882.6.11
    revision-id: john at arbash-meinel.com-20081210222704-465gxu7k0wehug6o
    parent: john at arbash-meinel.com-20081210222654-7sem2ud7397cd8vg
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: xml_cache
    timestamp: Wed 2008-12-10 16:27:04 -0600
    message:
      comment update
    modified:
      bzrlib/tests/test_fifo_cache.py test_fifo_cache.py-20081209212307-31ffjwvteyvmydnf-2
      bzrlib/xml8.py                 xml5.py-20050907032657-aac8f960815b66b1
    ------------------------------------------------------------
    revno: 3882.6.10
    revision-id: john at arbash-meinel.com-20081210222654-7sem2ud7397cd8vg
    parent: john at arbash-meinel.com-20081210213712-82i1n0uem1crrnoh
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: xml_cache
    timestamp: Wed 2008-12-10 16:26:54 -0600
    message:
      Add resize() functionality to the FIFO Cache.
    modified:
      bzrlib/fifo_cache.py           fifo_cache.py-20081209212307-31ffjwvteyvmydnf-1
    ------------------------------------------------------------
    revno: 3882.6.9
    revision-id: john at arbash-meinel.com-20081210213712-82i1n0uem1crrnoh
    parent: john at arbash-meinel.com-20081210185059-hfucbbmumvmes1ql
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: xml_cache
    timestamp: Wed 2008-12-10 15:37:12 -0600
    message:
      Add some more direct timings using time.clock() instead of lsprof.
    modified:
      bzrlib/xml8.py                 xml5.py-20050907032657-aac8f960815b66b1
    ------------------------------------------------------------
    revno: 3882.6.8
    revision-id: john at arbash-meinel.com-20081210185059-hfucbbmumvmes1ql
    parent: john at arbash-meinel.com-20081210182935-dejc81qksqka717d
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: xml_cache
    timestamp: Wed 2008-12-10 12:50:59 -0600
    message:
      Add detailed timings on the last 100 mysql revisions.
      
      Revert to the 'safe' copy-everything code.
    modified:
      bzrlib/xml8.py                 xml5.py-20050907032657-aac8f960815b66b1
    ------------------------------------------------------------
    revno: 3882.6.7
    revision-id: john at arbash-meinel.com-20081210182935-dejc81qksqka717d
    parent: john at arbash-meinel.com-20081210174726-7e0jy7j5kmq20alx
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: xml_cache
    timestamp: Wed 2008-12-10 12:29:35 -0600
    message:
      Do Inventory.add() optimizations, and determine 'best' results.
    modified:
      bzrlib/xml5.py                 xml5.py-20080328030717-t9guwinq8hom0ar3-1
      bzrlib/xml8.py                 xml5.py-20050907032657-aac8f960815b66b1
    ------------------------------------------------------------
    revno: 3882.6.6
    revision-id: john at arbash-meinel.com-20081210174726-7e0jy7j5kmq20alx
    parent: john at arbash-meinel.com-20081210173820-sjxjc8ktpuhnl1g4
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: xml_cache
    timestamp: Wed 2008-12-10 11:47:26 -0600
    message:
      Add some actual timings, supporting why we use a FIFOCache.
    modified:
      bzrlib/xml8.py                 xml5.py-20050907032657-aac8f960815b66b1
    ------------------------------------------------------------
    revno: 3882.6.5
    revision-id: john at arbash-meinel.com-20081210173820-sjxjc8ktpuhnl1g4
    parent: john at arbash-meinel.com-20081210172213-h2b0auuil3qaz28u
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: xml_cache
    timestamp: Wed 2008-12-10 11:38:20 -0600
    message:
      Use a FIFOCache instead of an LRUCache, and factor out elt.get
    modified:
      bzrlib/xml8.py                 xml5.py-20050907032657-aac8f960815b66b1
    ------------------------------------------------------------
    revno: 3882.6.4
    revision-id: john at arbash-meinel.com-20081210172213-h2b0auuil3qaz28u
    parent: john at arbash-meinel.com-20081208183041-9r88rfoms0y8cr5b
    parent: pqm at pqm.ubuntu.com-20081210082822-li6ku9s3k63kjrpr
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: xml_cache
    timestamp: Wed 2008-12-10 11:22:13 -0600
    message:
      Merge bzr.dev 3890, bring in the FIFOCache.
    added:
      bzrlib/fifo_cache.py           fifo_cache.py-20081209212307-31ffjwvteyvmydnf-1
      bzrlib/tests/per_repository/test_add_inventory_by_delta.py test_add_inventory_d-20081013002626-rut81igtlqb4590z-1
      bzrlib/tests/test_fifo_cache.py test_fifo_cache.py-20081209212307-31ffjwvteyvmydnf-2
    modified:
      NEWS                           NEWS-20050323055033-4e00b5db738777ff
      bzrlib/__init__.py             __init__.py-20050309040759-33e65acf91bbcd5d
      bzrlib/builtins.py             builtins.py-20050830033751-fc01482b9ca23183
      bzrlib/commit.py               commit.py-20050511101309-79ec1a0168e0e825
      bzrlib/inventory.py            inventory.py-20050309040759-6648b84ca2005b37
      bzrlib/lru_cache.py            lru_cache.py-20070119165515-tlw203kuwh0id5gv-1
      bzrlib/registry.py             lazy_factory.py-20060809213415-2gfvqadtvdn0phtg-1
      bzrlib/remote.py               remote.py-20060720103555-yeeg2x51vn0rbtdp-1
      bzrlib/repository.py           rev_storage.py-20051111201905-119e9401e46257e3
      bzrlib/tests/__init__.py       selftest.py-20050531073622-8d0e3c8845c97a64
      bzrlib/tests/blackbox/test_ls.py test_ls.py-20060712232047-0jraqpecwngee12y-1
      bzrlib/tests/blackbox/test_pull.py test_pull.py-20051201144907-64959364f629947f
      bzrlib/tests/blackbox/test_revision_info.py test_revision_info.py-20050917162600-21dab3877aa348d7
      bzrlib/tests/interrepository_implementations/__init__.py __init__.py-20060220054744-baf49a1f88f17b1a
      bzrlib/tests/per_repository/__init__.py __init__.py-20060131092037-9564957a7d4a841b
      bzrlib/tests/per_repository/test_commit_builder.py test_commit_builder.py-20060606110838-76e3ra5slucqus81-1
      bzrlib/tests/per_repository/test_repository.py test_repository.py-20060131092128-ad07f494f5c9d26c
      bzrlib/transport/__init__.py   transport.py-20050711165921-4978aa7ce1285ad5
      doc/developers/HACKING.txt     HACKING-20050805200004-2a5dc975d870f78c
    ------------------------------------------------------------
    revno: 3882.6.3
    revision-id: john at arbash-meinel.com-20081208183041-9r88rfoms0y8cr5b
    parent: john at arbash-meinel.com-20081208183004-gaxfel9xzzwrc093
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: xml_cache
    timestamp: Mon 2008-12-08 12:30:41 -0600
    message:
      If we are going to thrash the inventory entry cache, increase its size.
    modified:
      bzrlib/xml8.py                 xml5.py-20050907032657-aac8f960815b66b1
    ------------------------------------------------------------
    revno: 3882.6.2
    revision-id: john at arbash-meinel.com-20081208183004-gaxfel9xzzwrc093
    parent: john at arbash-meinel.com-20081208182757-2rls8q1ri36ub6e9
    parent: john at arbash-meinel.com-20081208182300-u1qmnxafwt2rr5dz
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: xml_cache
    timestamp: Mon 2008-12-08 12:30:04 -0600
    message:
      Merge the lru cache changes.
    modified:
      NEWS                           NEWS-20050323055033-4e00b5db738777ff
      bzrlib/lru_cache.py            lru_cache.py-20070119165515-tlw203kuwh0id5gv-1
      bzrlib/tests/test_lru_cache.py test_lru_cache.py-20070119165535-hph6rk4h9rzy4180-1
    ------------------------------------------------------------
    revno: 3882.6.1
    revision-id: john at arbash-meinel.com-20081208182757-2rls8q1ri36ub6e9
    parent: pqm at pqm.ubuntu.com-20081205181554-ofrdnafloc43bxkh
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: xml_cache
    timestamp: Mon 2008-12-08 12:27:57 -0600
    message:
      Add an InventoryEntry cache to the xml deserializer.
    modified:
      bzrlib/xml8.py                 xml5.py-20050907032657-aac8f960815b66b1
=== modified file 'NEWS'
--- a/NEWS	2008-12-19 06:14:59 +0000
+++ b/NEWS	2008-12-19 17:14:59 +0000
@@ -103,6 +103,15 @@
 
   INTERNALS:
 
+    * Added an ``InventoryEntry`` cache when deserializing inventories.
+      Can cut the time to iterate over multiple RevisionsTrees in half.
+      (John Arbash Meinel)
+
+    * Added ``bzrlib.fifo_cache.FIFOCache`` which is designed to have
+      minimal overhead versus using a plain dict for cache hits, at the
+      cost of not preserving the 'active' set as well as an ``LRUCache``.
+      (John Arbash Meinel)
+
     * ``KnitVersionedFiles.get_record_stream()`` will now chose a
       more optimal ordering when the keys are requested 'unordered'.
       Previously the order was fully random, now the records should be

=== modified file 'bzrlib/fifo_cache.py'
--- a/bzrlib/fifo_cache.py	2008-12-10 00:07:33 +0000
+++ b/bzrlib/fifo_cache.py	2008-12-10 22:26:54 +0000
@@ -73,6 +73,10 @@
         if len(self) > self._max_cache:
             self.cleanup()
 
+    def cache_size(self):
+        """Get the number of entries we will cache."""
+        return self._max_cache
+
     def cleanup(self):
         """Clear the cache until it shrinks to the requested size.
 
@@ -108,6 +112,21 @@
         key = self._queue.popleft()
         self._remove(key)
 
+    def resize(self, max_cache, after_cleanup_count=None):
+        """Increase/decrease the number of cached entries.
+
+        :param max_cache: The maximum number of entries to cache.
+        :param after_cleanup_count: After cleanup, we should have at most this
+            many entries. This defaults to 80% of max_cache.
+        """
+        self._max_cache = max_cache
+        if after_cleanup_count is None:
+            self._after_cleanup_count = max_cache * 8 / 10
+        else:
+            self._after_cleanup_count = min(max_cache, after_cleanup_count)
+        if len(self) > self._max_cache:
+            self.cleanup()
+
     # raise NotImplementedError on dict functions that would mutate the cache
     # which have not been properly implemented yet.
     def copy(self):
@@ -211,6 +230,10 @@
             # Time to cleanup
             self.cleanup()
 
+    def cache_size(self):
+        """Get the number of bytes we will cache."""
+        return self._max_size
+
     def cleanup(self):
         """Clear the cache until it shrinks to the requested size.
 
@@ -226,3 +249,20 @@
         val = FIFOCache._remove(self, key)
         self._value_size -= self._compute_size(val)
         return val
+
+    def resize(self, max_size, after_cleanup_size=None):
+        """Increase/decrease the amount of cached data.
+
+        :param max_size: The maximum number of bytes to cache.
+        :param after_cleanup_size: After cleanup, we should have at most this
+            many bytes cached. This defaults to 80% of max_size.
+        """
+        FIFOCache.resize(self, max_size)
+        self._max_size = max_size
+        if after_cleanup_size is None:
+            self._after_cleanup_size = max_size * 8 / 10
+        else:
+            self._after_cleanup_size = min(max_size, after_cleanup_size)
+        if self._value_size > self._max_size:
+            self.cleanup()
+

=== modified file 'bzrlib/inventory.py'
--- a/bzrlib/inventory.py	2008-12-05 22:18:09 +0000
+++ b/bzrlib/inventory.py	2008-12-11 20:59:28 +0000
@@ -341,6 +341,9 @@
                    self.revision))
 
     def __eq__(self, other):
+        if other is self:
+            # For the case when objects are cached
+            return True
         if not isinstance(other, InventoryEntry):
             return NotImplemented
 
@@ -1235,20 +1238,33 @@
 
     def _make_delta(self, old):
         """Make an inventory delta from two inventories."""
-        old_ids = set(old)
-        new_ids = set(self)
+        old_getter = getattr(old, '_byid', old)
+        new_getter = self._byid
+        old_ids = set(old_getter)
+        new_ids = set(new_getter)
         adds = new_ids - old_ids
         deletes = old_ids - new_ids
-        common = old_ids.intersection(new_ids)
+        if not adds and not deletes:
+            common = new_ids
+        else:
+            common = old_ids.intersection(new_ids)
         delta = []
         for file_id in deletes:
             delta.append((old.id2path(file_id), None, file_id, None))
         for file_id in adds:
             delta.append((None, self.id2path(file_id), file_id, self[file_id]))
         for file_id in common:
-            if old[file_id] != self[file_id]:
+            new_ie = new_getter[file_id]
+            old_ie = old_getter[file_id]
+            # If xml_serializer returns the cached InventoryEntries (rather
+            # than always doing .copy()), inlining the 'is' check saves 2.7M
+            # calls to __eq__.  Under lsprof this saves 20s => 6s.
+            # It is a minor improvement without lsprof.
+            if old_ie is new_ie or old_ie == new_ie:
+                continue
+            else:
                 delta.append((old.id2path(file_id), self.id2path(file_id),
-                    file_id, self[file_id]))
+                              file_id, new_ie))
         return delta
 
     def remove_recursive_id(self, file_id):

=== modified file 'bzrlib/repository.py'
--- a/bzrlib/repository.py	2008-12-11 21:57:21 +0000
+++ b/bzrlib/repository.py	2008-12-19 17:14:59 +0000
@@ -25,6 +25,7 @@
     check,
     debug,
     errors,
+    fifo_cache,
     generate_ids,
     gpg,
     graph,
@@ -842,6 +843,8 @@
         # Should fetch trigger a reconcile after the fetch? Only needed for
         # some repository formats that can suffer internal inconsistencies.
         self._fetch_reconcile = False
+        # An InventoryEntry cache, used during deserialization
+        self._inventory_entry_cache = fifo_cache.FIFOCache(10*1024)
 
     def __repr__(self):
         return '%s(%r)' % (self.__class__.__name__,
@@ -1145,6 +1148,8 @@
                 raise errors.BzrError(
                     'Must end write groups before releasing write locks.')
         self.control_files.unlock()
+        if self.control_files._lock_count == 0:
+            self._inventory_entry_cache.clear()
         for repo in self._fallback_repositories:
             repo.unlock()
 
@@ -1695,7 +1700,8 @@
         :param revision_id: The expected revision id of the inventory.
         :param xml: A serialised inventory.
         """
-        result = self._serializer.read_inventory_from_string(xml, revision_id)
+        result = self._serializer.read_inventory_from_string(xml, revision_id,
+                    entry_cache=self._inventory_entry_cache)
         if result.revision_id != revision_id:
             raise AssertionError('revision id mismatch %s != %s' % (
                 result.revision_id, revision_id))

=== modified file 'bzrlib/tests/__init__.py'
--- a/bzrlib/tests/__init__.py	2008-12-11 03:08:03 +0000
+++ b/bzrlib/tests/__init__.py	2008-12-12 20:06:28 +0000
@@ -789,7 +789,7 @@
 
     def _clear_debug_flags(self):
         """Prevent externally set debug flags affecting tests.
-        
+
         Tests that want to use debug flags can just set them in the
         debug_flags set during setup/teardown.
         """

=== modified file 'bzrlib/tests/test_fifo_cache.py'
--- a/bzrlib/tests/test_fifo_cache.py	2008-12-09 22:26:40 +0000
+++ b/bzrlib/tests/test_fifo_cache.py	2008-12-10 22:27:04 +0000
@@ -41,6 +41,12 @@
         self.assertEqual([2], list(c.itervalues()))
         self.assertEqual({1: 2}, c)
 
+    def test_cache_size(self):
+        c = fifo_cache.FIFOCache()
+        self.assertEqual(100, c.cache_size())
+        c.resize(20, 5)
+        self.assertEqual(20, c.cache_size())
+
     def test_missing(self):
         c = fifo_cache.FIFOCache()
         self.assertRaises(KeyError, c.__getitem__, 1)
@@ -114,6 +120,45 @@
         c = fifo_cache.FIFOCache()
         self.assertRaises(NotImplementedError, c.popitem)
 
+    def test_resize_smaller(self):
+        c = fifo_cache.FIFOCache()
+        c[1] = 2
+        c[2] = 3
+        c[3] = 4
+        c[4] = 5
+        c[5] = 6
+        # No cleanup, because it is the exact size
+        c.resize(5)
+        self.assertEqual({1: 2, 2: 3, 3: 4, 4: 5, 5: 6}, c)
+        self.assertEqual(5, c.cache_size())
+        # Adding one more will trigger a cleanup, though
+        c[6] = 7
+        self.assertEqual({3: 4, 4: 5, 5: 6, 6: 7}, c)
+        c.resize(3, 2)
+        self.assertEqual({5: 6, 6: 7}, c)
+
+    def test_resize_larger(self):
+        c = fifo_cache.FIFOCache(5, 4)
+        c[1] = 2
+        c[2] = 3
+        c[3] = 4
+        c[4] = 5
+        c[5] = 6
+        # No cleanup, because it is the exact size
+        c.resize(10)
+        self.assertEqual({1: 2, 2: 3, 3: 4, 4: 5, 5: 6}, c)
+        self.assertEqual(10, c.cache_size())
+        c[6] = 7
+        c[7] = 8
+        c[8] = 9
+        c[9] = 10
+        c[10] = 11
+        self.assertEqual({1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9,
+                          9: 10, 10: 11}, c)
+        c[11] = 12
+        self.assertEqual({4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10, 10: 11,
+                          11: 12}, c)
+
     def test_setdefault(self):
         c = fifo_cache.FIFOCache(5, 4)
         c['one'] = 1
@@ -208,6 +253,7 @@
         self.assertEqual(['2'], c.values())
         self.assertEqual(['2'], list(c.itervalues()))
         self.assertEqual({1: '2'}, c)
+        self.assertEqual(1024*1024, c.cache_size())
 
     def test_missing(self):
         c = fifo_cache.FIFOSizeCache()
@@ -248,3 +294,32 @@
         c[1] = 'abcdefgh'
         self.assertEqual({}, c)
         self.assertEqual(0, c._value_size)
+
+    def test_resize_smaller(self):
+        c = fifo_cache.FIFOSizeCache(20, 16)
+        c[1] = 'a'
+        c[2] = 'bc'
+        c[3] = 'def'
+        c[4] = 'ghij'
+        # No cleanup, because it is the exact size
+        c.resize(10, 8)
+        self.assertEqual({1: 'a', 2: 'bc', 3: 'def', 4: 'ghij'}, c)
+        self.assertEqual(10, c.cache_size())
+        # Adding one more will trigger a cleanup, though
+        c[5] = 'k'
+        self.assertEqual({3: 'def', 4: 'ghij', 5: 'k'}, c)
+        c.resize(5, 4)
+        self.assertEqual({5: 'k'}, c)
+
+    def test_resize_larger(self):
+        c = fifo_cache.FIFOSizeCache(10, 8)
+        c[1] = 'a'
+        c[2] = 'bc'
+        c[3] = 'def'
+        c[4] = 'ghij'
+        c.resize(12, 10)
+        self.assertEqual({1: 'a', 2: 'bc', 3: 'def', 4: 'ghij'}, c)
+        c[5] = 'kl'
+        self.assertEqual({1: 'a', 2: 'bc', 3: 'def', 4: 'ghij', 5: 'kl'}, c)
+        c[6] = 'mn'
+        self.assertEqual({4: 'ghij', 5: 'kl', 6: 'mn'}, c)

=== modified file 'bzrlib/xml4.py'
--- a/bzrlib/xml4.py	2008-04-24 07:22:53 +0000
+++ b/bzrlib/xml4.py	2008-12-12 20:06:28 +0000
@@ -57,7 +57,7 @@
         return e
 
 
-    def _unpack_inventory(self, elt, revision_id=None):
+    def _unpack_inventory(self, elt, revision_id=None, entry_cache=None):
         """Construct from XML Element
 
         :param revision_id: Ignored parameter used by xml5.
@@ -65,14 +65,14 @@
         root_id = elt.get('file_id') or ROOT_ID
         inv = Inventory(root_id)
         for e in elt:
-            ie = self._unpack_entry(e)
+            ie = self._unpack_entry(e, entry_cache=entry_cache)
             if ie.parent_id == ROOT_ID:
                 ie.parent_id = root_id
             inv.add(ie)
         return inv
 
 
-    def _unpack_entry(self, elt):
+    def _unpack_entry(self, elt, entry_cache=None):
         ## original format inventories don't have a parent_id for
         ## nodes in the root directory, but it's cleaner to use one
         ## internally.

=== modified file 'bzrlib/xml5.py'
--- a/bzrlib/xml5.py	2008-04-24 07:22:53 +0000
+++ b/bzrlib/xml5.py	2008-12-13 03:19:40 +0000
@@ -16,6 +16,7 @@
 
 from bzrlib import (
     cache_utf8,
+    errors,
     inventory,
     xml6,
     xml8,
@@ -29,7 +30,7 @@
     format_num = '5'
     root_id = inventory.ROOT_ID
 
-    def _unpack_inventory(self, elt, revision_id):
+    def _unpack_inventory(self, elt, revision_id, entry_cache=None):
         """Construct from XML Element
         """
         root_id = elt.get('file_id') or inventory.ROOT_ID
@@ -44,13 +45,36 @@
         if data_revision_id is not None:
             revision_id = cache_utf8.encode(data_revision_id)
         inv = inventory.Inventory(root_id, revision_id=revision_id)
+        # Optimizations tested
+        #   baseline w/entry cache  2.85s
+        #   using inv._byid         2.55s
+        #   avoiding attributes     2.46s
+        #   adding assertions       2.50s
+        #   last_parent cache       2.52s (worse, removed)
+        unpack_entry = self._unpack_entry
+        byid = inv._byid
         for e in elt:
-            ie = self._unpack_entry(e)
-            if ie.parent_id is None:
-                ie.parent_id = root_id
-            inv.add(ie)
+            ie = unpack_entry(e, entry_cache=entry_cache)
+            parent_id = ie.parent_id
+            if parent_id is None:
+                ie.parent_id = parent_id = root_id
+            try:
+                parent = byid[parent_id]
+            except KeyError:
+                raise errors.BzrError("parent_id {%s} not in inventory"
+                                      % (parent_id,))
+            if ie.file_id in byid:
+                raise errors.DuplicateFileId(ie.file_id,
+                                             byid[ie.file_id])
+            if ie.name in parent.children:
+                raise errors.BzrError("%s is already versioned"
+                    % (osutils.pathjoin(inv.id2path(parent_id),
+                       ie.name).encode('utf-8'),))
+            parent.children[ie.name] = ie
+            byid[ie.file_id] = ie
         if revision_id is not None:
             inv.root.revision = revision_id
+        self._check_cache_size(len(inv), entry_cache)
         return inv
 
     def _check_revisions(self, inv):

=== modified file 'bzrlib/xml7.py'
--- a/bzrlib/xml7.py	2007-02-26 02:56:24 +0000
+++ b/bzrlib/xml7.py	2008-12-12 20:06:28 +0000
@@ -28,7 +28,7 @@
     supported_kinds = set(['file', 'directory', 'symlink', 'tree-reference'])
     format_num = '7'
 
-    def _unpack_entry(self, elt):
+    def _unpack_entry(self, elt, entry_cache=None):
         kind = elt.tag
         if not kind in self.supported_kinds:
             raise AssertionError('unsupported entry kind %s' % kind)

=== modified file 'bzrlib/xml8.py'
--- a/bzrlib/xml8.py	2008-04-24 07:22:53 +0000
+++ b/bzrlib/xml8.py	2008-12-13 03:19:40 +0000
@@ -22,6 +22,7 @@
     errors,
     inventory,
     revision as _mod_revision,
+    trace,
     )
 from bzrlib.xml_serializer import SubElement, Element, Serializer
 from bzrlib.inventory import ROOT_ID, Inventory, InventoryEntry
@@ -166,6 +167,34 @@
         if inv.root.revision is None:
             raise AssertionError()
 
+    def _check_cache_size(self, inv_size, entry_cache):
+        """Check that the entry_cache is large enough.
+
+        We want the cache to be ~2x the size of an inventory. The reason is
+        because we use a FIFO cache, and how Inventory records are likely to
+        change. In general, you have a small number of records which change
+        often, and a lot of records which do not change at all. So when the
+        cache gets full, you actually flush out a lot of the records you are
+        interested in, which means you need to recreate all of those records.
+        An LRU Cache would be better, but the overhead negates the cache
+        coherency benefit.
+
+        One way to look at it, only the size of the cache > len(inv) is your
+        'working' set. And in general, it shouldn't be a problem to hold 2
+        inventories in memory anyway.
+
+        :param inv_size: The number of entries in an inventory.
+        """
+        if entry_cache is None:
+            return
+        # 1.5 times might also be reasonable.
+        recommended_min_cache_size = inv_size * 1.5
+        if entry_cache.cache_size() < recommended_min_cache_size:
+            recommended_cache_size = inv_size * 2
+            trace.mutter('Resizing the inventory entry cache from %d to %d',
+                         entry_cache.cache_size(), recommended_cache_size)
+            entry_cache.resize(recommended_cache_size)
+
     def write_inventory_to_lines(self, inv):
         """Return a list of lines with the encoded inventory."""
         return self.write_inventory(inv, None)
@@ -337,7 +366,7 @@
             prop_elt.tail = '\n'
         top_elt.tail = '\n'
 
-    def _unpack_inventory(self, elt, revision_id=None):
+    def _unpack_inventory(self, elt, revision_id=None, entry_cache=None):
         """Construct from XML Element"""
         if elt.tag != 'inventory':
             raise errors.UnexpectedInventoryFormat('Root tag is %r' % elt.tag)
@@ -350,46 +379,100 @@
             revision_id = cache_utf8.encode(revision_id)
         inv = inventory.Inventory(root_id=None, revision_id=revision_id)
         for e in elt:
-            ie = self._unpack_entry(e)
+            ie = self._unpack_entry(e, entry_cache=entry_cache)
             inv.add(ie)
+        self._check_cache_size(len(inv), entry_cache)
         return inv
 
-    def _unpack_entry(self, elt):
+    def _unpack_entry(self, elt, entry_cache=None):
+        elt_get = elt.get
+        file_id = elt_get('file_id')
+        revision = elt_get('revision')
+        # Check and see if we have already unpacked this exact entry
+        # Some timings for "repo.revision_trees(last_100_revs)"
+        #               bzr     mysql
+        #   unmodified  4.1s    40.8s
+        #   using lru   3.5s
+        #   using fifo  2.83s   29.1s
+        #   lru._cache  2.8s
+        #   dict        2.75s   26.8s
+        #   inv.add     2.5s    26.0s
+        #   no_copy     2.00s   20.5s
+        #   no_c,dict   1.95s   18.0s
+        # Note that a cache of 10k nodes is more than sufficient to hold all of
+        # the inventory for the last 100 revs for bzr, but not for mysql (20k
+        # is enough for mysql, which saves the same 2s as using a dict)
+
+        # Breakdown of mysql using time.clock()
+        #   4.1s    2 calls to element.get for file_id, revision_id
+        #   4.5s    cache_hit lookup
+        #   7.1s    InventoryFile.copy()
+        #   2.4s    InventoryDirectory.copy()
+        #   0.4s    decoding unique entries
+        #   1.6s    decoding entries after FIFO fills up
+        #   0.8s    Adding nodes to FIFO (including flushes)
+        #   0.1s    cache miss lookups
+        # Using an LRU cache
+        #   4.1s    2 calls to element.get for file_id, revision_id
+        #   9.9s    cache_hit lookup
+        #   10.8s   InventoryEntry.copy()
+        #   0.3s    cache miss lookus
+        #   1.2s    decoding entries
+        #   1.0s    adding nodes to LRU
+        if entry_cache is not None and revision is not None:
+            key = (file_id, revision)
+            try:
+                # We copy it, because some operatations may mutate it
+                cached_ie = entry_cache[key]
+            except KeyError:
+                pass
+            else:
+                # Only copying directory entries drops us 2.85s => 2.35s
+                # if cached_ie.kind == 'directory':
+                #     return cached_ie.copy()
+                # return cached_ie
+                return cached_ie.copy()
+
         kind = elt.tag
         if not InventoryEntry.versionable_kind(kind):
             raise AssertionError('unsupported entry kind %s' % kind)
 
         get_cached = _get_utf8_or_ascii
 
-        parent_id = elt.get('parent_id')
+        file_id = get_cached(file_id)
+        if revision is not None:
+            revision = get_cached(revision)
+        parent_id = elt_get('parent_id')
         if parent_id is not None:
             parent_id = get_cached(parent_id)
-        file_id = get_cached(elt.get('file_id'))
 
         if kind == 'directory':
             ie = inventory.InventoryDirectory(file_id,
-                                              elt.get('name'),
+                                              elt_get('name'),
                                               parent_id)
         elif kind == 'file':
             ie = inventory.InventoryFile(file_id,
-                                         elt.get('name'),
+                                         elt_get('name'),
                                          parent_id)
-            ie.text_sha1 = elt.get('text_sha1')
-            if elt.get('executable') == 'yes':
+            ie.text_sha1 = elt_get('text_sha1')
+            if elt_get('executable') == 'yes':
                 ie.executable = True
-            v = elt.get('text_size')
+            v = elt_get('text_size')
             ie.text_size = v and int(v)
         elif kind == 'symlink':
             ie = inventory.InventoryLink(file_id,
-                                         elt.get('name'),
+                                         elt_get('name'),
                                          parent_id)
-            ie.symlink_target = elt.get('symlink_target')
+            ie.symlink_target = elt_get('symlink_target')
         else:
             raise errors.UnsupportedInventoryKind(kind)
-        revision = elt.get('revision')
-        if revision is not None:
-            revision = get_cached(revision)
         ie.revision = revision
+        if revision is not None and entry_cache is not None:
+            # We cache a copy() because callers like to mutate objects, and
+            # that would cause the item in cache to mutate as well.
+            # This has a small effect on many-inventory performance, because
+            # the majority fraction is spent in cache hits, not misses.
+            entry_cache[key] = ie.copy()
 
         return ie
 

=== modified file 'bzrlib/xml_serializer.py'
--- a/bzrlib/xml_serializer.py	2008-06-12 15:51:15 +0000
+++ b/bzrlib/xml_serializer.py	2008-12-13 03:19:40 +0000
@@ -59,7 +59,8 @@
     def write_inventory_to_string(self, inv):
         raise NotImplementedError(self.write_inventory_to_string)
 
-    def read_inventory_from_string(self, xml_string, revision_id=None):
+    def read_inventory_from_string(self, xml_string, revision_id=None,
+                                   entry_cache=None):
         """Read xml_string into an inventory object.
 
         :param xml_string: The xml to read.
@@ -70,9 +71,13 @@
             serialised without a revision identifier can be given the right
             revision id (but not for working tree inventories where users can
             edit the data without triggering checksum errors or anything).
+        :param entry_cache: An optional cache of InventoryEntry objects. If
+            supplied we will look up entries via (file_id, revision_id) which
+            should map to a valid InventoryEntry (File/Directory/etc) object.
         """
         try:
-            return self._unpack_inventory(fromstring(xml_string), revision_id)
+            return self._unpack_inventory(fromstring(xml_string), revision_id,
+                                          entry_cache=entry_cache)
         except ParseError, e:
             raise errors.UnexpectedInventoryFormat(e)
 




More information about the bazaar-commits mailing list