Rev 5836: (jameinel) Bug #380202, in file:///home/pqm/archives/thelove/bzr/%2Btrunk/

Canonical.com Patch Queue Manager pqm at pqm.ubuntu.com
Fri May 6 16:36:19 UTC 2011


At file:///home/pqm/archives/thelove/bzr/%2Btrunk/

------------------------------------------------------------
revno: 5836 [merge]
revision-id: pqm at pqm.ubuntu.com-20110506163615-6xc9mzcb7t2824m2
parent: pqm at pqm.ubuntu.com-20110506091002-ge55v96dgz9hfz67
parent: john at arbash-meinel.com-20110506151733-vi5zpzl5ij6gwt17
committer: Canonical.com Patch Queue Manager <pqm at pqm.ubuntu.com>
branch nick: +trunk
timestamp: Fri 2011-05-06 16:36:15 +0000
message:
  (jameinel) Bug #380202,
   don't bother rewriting the dirstate if <10 items have changed. (John A
   Meinel)
modified:
  bzrlib/_dirstate_helpers_pyx.pyx dirstate_helpers.pyx-20070503201057-u425eni465q4idwn-3
  bzrlib/dirstate.py             dirstate.py-20060728012006-d6mvoihjb3je9peu-1
  bzrlib/tests/per_workingtree/test_workingtree.py test_workingtree.py-20060203003124-817757d3e31444fb
  bzrlib/tests/test__dirstate_helpers.py test_dirstate_helper-20070504035751-jsbn00xodv0y1eve-2
  bzrlib/tests/test_dirstate.py  test_dirstate.py-20060728012006-d6mvoihjb3je9peu-2
  bzrlib/workingtree_4.py        workingtree_4.py-20070208044105-5fgpc5j3ljlh5q6c-1
  doc/developers/dirstate.txt    dirstate.txt-20070618020404-cdhv0ecgrukomemg-2
  doc/en/release-notes/bzr-2.4.txt bzr2.4.txt-20110114053217-k7ym9jfz243fddjm-1
=== modified file 'bzrlib/_dirstate_helpers_pyx.pyx'
--- a/bzrlib/_dirstate_helpers_pyx.pyx	2011-04-19 13:59:07 +0000
+++ b/bzrlib/_dirstate_helpers_pyx.pyx	2011-04-22 14:12:22 +0000
@@ -941,6 +941,12 @@
             # re-writing a dirstate for just this
             worth_saving = 0
     elif minikind == c'l':
+        if saved_minikind == c'l':
+            # If the object hasn't changed kind, it isn't worth saving the
+            # dirstate just for a symlink. The default is 'fast symlinks' which
+            # save the target in the inode entry, rather than separately. So to
+            # stat, we've already read everything off disk.
+            worth_saving = 0
         link_or_sha1 = self._read_link(abspath, saved_link_or_sha1)
         if self._cutoff_time is None:
             self._sha_cutoff_time()
@@ -952,7 +958,9 @@
             entry[1][0] = ('l', '', stat_value.st_size,
                            False, DirState.NULLSTAT)
     if worth_saving:
-        self._dirblock_state = DirState.IN_MEMORY_MODIFIED
+        # Note, even though _mark_modified will only set
+        # IN_MEMORY_HASH_MODIFIED, it still isn't worth 
+        self._mark_modified([entry])
     return link_or_sha1
 
 

=== modified file 'bzrlib/dirstate.py'
--- a/bzrlib/dirstate.py	2011-04-21 20:32:16 +0000
+++ b/bzrlib/dirstate.py	2011-05-06 15:15:44 +0000
@@ -366,6 +366,7 @@
     NOT_IN_MEMORY = 0
     IN_MEMORY_UNMODIFIED = 1
     IN_MEMORY_MODIFIED = 2
+    IN_MEMORY_HASH_MODIFIED = 3 # Only hash-cache updates
 
     # A pack_stat (the x's) that is just noise and will never match the output
     # of base64 encode.
@@ -375,11 +376,15 @@
     HEADER_FORMAT_2 = '#bazaar dirstate flat format 2\n'
     HEADER_FORMAT_3 = '#bazaar dirstate flat format 3\n'
 
-    def __init__(self, path, sha1_provider):
+    def __init__(self, path, sha1_provider, worth_saving_limit=0):
         """Create a  DirState object.
 
         :param path: The path at which the dirstate file on disk should live.
         :param sha1_provider: an object meeting the SHA1Provider interface.
+        :param worth_saving_limit: when the exact number of hash changed
+            entries is known, only bother saving the dirstate if more than
+            this count of entries have changed.
+            -1 means never save hash changes, 0 means always save hash changes.
         """
         # _header_state and _dirblock_state represent the current state
         # of the dirstate metadata and the per-row data respectiely.
@@ -422,11 +427,46 @@
         # during commit.
         self._last_block_index = None
         self._last_entry_index = None
+        # The set of known hash changes
+        self._known_hash_changes = set()
+        # How many hash changed entries can we have without saving
+        self._worth_saving_limit = worth_saving_limit
 
     def __repr__(self):
         return "%s(%r)" % \
             (self.__class__.__name__, self._filename)
 
+    def _mark_modified(self, hash_changed_entries=None, header_modified=False):
+        """Mark this dirstate as modified.
+
+        :param hash_changed_entries: if non-None, mark just these entries as
+            having their hash modified.
+        :param header_modified: mark the header modified as well, not just the
+            dirblocks.
+        """
+        #trace.mutter_callsite(3, "modified hash entries: %s", hash_changed_entries)
+        if hash_changed_entries:
+            self._known_hash_changes.update([e[0] for e in hash_changed_entries])
+            if self._dirblock_state in (DirState.NOT_IN_MEMORY,
+                                        DirState.IN_MEMORY_UNMODIFIED):
+                # If the dirstate is already marked a IN_MEMORY_MODIFIED, then
+                # that takes precedence.
+                self._dirblock_state = DirState.IN_MEMORY_HASH_MODIFIED
+        else:
+            # TODO: Since we now have a IN_MEMORY_HASH_MODIFIED state, we
+            #       should fail noisily if someone tries to set
+            #       IN_MEMORY_MODIFIED but we don't have a write-lock!
+            # We don't know exactly what changed so disable smart saving
+            self._dirblock_state = DirState.IN_MEMORY_MODIFIED
+        if header_modified:
+            self._header_state = DirState.IN_MEMORY_MODIFIED
+
+    def _mark_unmodified(self):
+        """Mark this dirstate as unmodified."""
+        self._header_state = DirState.IN_MEMORY_UNMODIFIED
+        self._dirblock_state = DirState.IN_MEMORY_UNMODIFIED
+        self._known_hash_changes = set()
+
     def add(self, path, file_id, kind, stat, fingerprint):
         """Add a path to be tracked.
 
@@ -558,7 +598,7 @@
         if kind == 'directory':
            # insert a new dirblock
            self._ensure_block(block_index, entry_index, utf8path)
-        self._dirblock_state = DirState.IN_MEMORY_MODIFIED
+        self._mark_modified()
         if self._id_index:
             self._add_to_id_index(self._id_index, entry_key)
 
@@ -1030,8 +1070,7 @@
 
         self._ghosts = []
         self._parents = [parents[0]]
-        self._dirblock_state = DirState.IN_MEMORY_MODIFIED
-        self._header_state = DirState.IN_MEMORY_MODIFIED
+        self._mark_modified(header_modified=True)
 
     def _empty_parent_info(self):
         return [DirState.NULL_PARENT_DETAILS] * (len(self._parents) -
@@ -1567,8 +1606,7 @@
             # the active tree.
             raise errors.InconsistentDeltaDelta(delta, "error from _get_entry.")
 
-        self._dirblock_state = DirState.IN_MEMORY_MODIFIED
-        self._header_state = DirState.IN_MEMORY_MODIFIED
+        self._mark_modified(header_modified=True)
         self._id_index = None
         return
 
@@ -1747,7 +1785,7 @@
                 and stat_value.st_ctime < self._cutoff_time):
                 entry[1][0] = ('f', sha1, stat_value.st_size, entry[1][0][3],
                                packed_stat)
-                self._dirblock_state = DirState.IN_MEMORY_MODIFIED
+                self._mark_modified([entry])
 
     def _sha_cutoff_time(self):
         """Return cutoff time.
@@ -1811,14 +1849,13 @@
         """Serialise the entire dirstate to a sequence of lines."""
         if (self._header_state == DirState.IN_MEMORY_UNMODIFIED and
             self._dirblock_state == DirState.IN_MEMORY_UNMODIFIED):
-            # read whats on disk.
+            # read what's on disk.
             self._state_file.seek(0)
             return self._state_file.readlines()
         lines = []
         lines.append(self._get_parents_line(self.get_parent_ids()))
         lines.append(self._get_ghosts_line(self._ghosts))
-        # append the root line which is special cased
-        lines.extend(map(self._entry_to_line, self._iter_entries()))
+        lines.extend(self._get_entry_lines())
         return self._get_output_lines(lines)
 
     def _get_ghosts_line(self, ghost_ids):
@@ -1829,6 +1866,10 @@
         """Create a line for the state file for parents information."""
         return '\0'.join([str(len(parent_ids))] + parent_ids)
 
+    def _get_entry_lines(self):
+        """Create lines for entries."""
+        return map(self._entry_to_line, self._iter_entries())
+
     def _get_fields_to_entry(self):
         """Get a function which converts entry fields into a entry record.
 
@@ -2222,18 +2263,22 @@
         """The number of parent entries in each record row."""
         return len(self._parents) - len(self._ghosts)
 
-    @staticmethod
-    def on_file(path, sha1_provider=None):
+    @classmethod
+    def on_file(cls, path, sha1_provider=None, worth_saving_limit=0):
         """Construct a DirState on the file at path "path".
 
         :param path: The path at which the dirstate file on disk should live.
         :param sha1_provider: an object meeting the SHA1Provider interface.
             If None, a DefaultSHA1Provider is used.
+        :param worth_saving_limit: when the exact number of hash changed
+            entries is known, only bother saving the dirstate if more than
+            this count of entries have changed. -1 means never save.
         :return: An unlocked DirState object, associated with the given path.
         """
         if sha1_provider is None:
             sha1_provider = DefaultSHA1Provider()
-        result = DirState(path, sha1_provider)
+        result = cls(path, sha1_provider,
+                     worth_saving_limit=worth_saving_limit)
         return result
 
     def _read_dirblocks_if_needed(self):
@@ -2331,9 +2376,11 @@
             trace.mutter('Not saving DirState because '
                     '_changes_aborted is set.')
             return
-        if (self._header_state == DirState.IN_MEMORY_MODIFIED or
-            self._dirblock_state == DirState.IN_MEMORY_MODIFIED):
-
+        # TODO: Since we now distinguish IN_MEMORY_MODIFIED from
+        #       IN_MEMORY_HASH_MODIFIED, we should only fail quietly if we fail
+        #       to save an IN_MEMORY_HASH_MODIFIED, and fail *noisily* if we
+        #       fail to save IN_MEMORY_MODIFIED
+        if self._worth_saving():
             grabbed_write_lock = False
             if self._lock_state != 'w':
                 grabbed_write_lock, new_lock = self._lock_token.temporary_write_lock()
@@ -2347,12 +2394,12 @@
                     # We couldn't grab a write lock, so we switch back to a read one
                     return
             try:
+                lines = self.get_lines()
                 self._state_file.seek(0)
-                self._state_file.writelines(self.get_lines())
+                self._state_file.writelines(lines)
                 self._state_file.truncate()
                 self._state_file.flush()
-                self._header_state = DirState.IN_MEMORY_UNMODIFIED
-                self._dirblock_state = DirState.IN_MEMORY_UNMODIFIED
+                self._mark_unmodified()
             finally:
                 if grabbed_write_lock:
                     self._lock_token = self._lock_token.restore_read_lock()
@@ -2361,6 +2408,26 @@
                     #       not changed contents. Since restore_read_lock may
                     #       not be an atomic operation.
 
+    def _worth_saving(self):
+        """Is it worth saving the dirstate or not?"""
+        if (self._header_state == DirState.IN_MEMORY_MODIFIED
+            or self._dirblock_state == DirState.IN_MEMORY_MODIFIED):
+            return True
+        if self._dirblock_state == DirState.IN_MEMORY_HASH_MODIFIED:
+            if self._worth_saving_limit == -1:
+                # We never save hash changes when the limit is -1
+                return False
+            # If we're using smart saving and only a small number of
+            # entries have changed their hash, don't bother saving. John has
+            # suggested using a heuristic here based on the size of the
+            # changed files and/or tree. For now, we go with a configurable
+            # number of changes, keeping the calculation time
+            # as low overhead as possible. (This also keeps all existing
+            # tests passing as the default is 0, i.e. always save.)
+            if len(self._known_hash_changes) >= self._worth_saving_limit:
+                return True
+        return False
+
     def _set_data(self, parent_ids, dirblocks):
         """Set the full dirstate data in memory.
 
@@ -2374,8 +2441,7 @@
         """
         # our memory copy is now authoritative.
         self._dirblocks = dirblocks
-        self._header_state = DirState.IN_MEMORY_MODIFIED
-        self._dirblock_state = DirState.IN_MEMORY_MODIFIED
+        self._mark_modified(header_modified=True)
         self._parents = list(parent_ids)
         self._id_index = None
         self._packed_stat_index = None
@@ -2401,7 +2467,14 @@
         self._make_absent(entry)
         self.update_minimal(('', '', new_id), 'd',
             path_utf8='', packed_stat=entry[1][0][4])
-        self._dirblock_state = DirState.IN_MEMORY_MODIFIED
+        self._mark_modified()
+        # XXX: This was added by Ian, we need to make sure there
+        #      are tests for it, because it isn't in bzr.dev TRUNK
+        #      It looks like the only place it is called is in setting the root
+        #      id of the tree. So probably we never had an _id_index when we
+        #      don't even have a root yet.
+        if self._id_index is not None:
+            self._add_to_id_index(self._id_index, entry[0])
 
     def set_parent_trees(self, trees, ghosts):
         """Set the parent trees for the dirstate.
@@ -2542,8 +2615,7 @@
         self._entries_to_current_state(new_entries)
         self._parents = [rev_id for rev_id, tree in trees]
         self._ghosts = list(ghosts)
-        self._header_state = DirState.IN_MEMORY_MODIFIED
-        self._dirblock_state = DirState.IN_MEMORY_MODIFIED
+        self._mark_modified(header_modified=True)
         self._id_index = id_index
 
     def _sort_entries(self, entry_list):
@@ -2686,7 +2758,7 @@
                         current_old[0][1].decode('utf8'))
                 self._make_absent(current_old)
                 current_old = advance(old_iterator)
-        self._dirblock_state = DirState.IN_MEMORY_MODIFIED
+        self._mark_modified()
         self._id_index = None
         self._packed_stat_index = None
         if tracing:
@@ -2758,7 +2830,7 @@
             if update_tree_details[0][0] == 'a': # absent
                 raise AssertionError('bad row %r' % (update_tree_details,))
             update_tree_details[0] = DirState.NULL_PARENT_DETAILS
-        self._dirblock_state = DirState.IN_MEMORY_MODIFIED
+        self._mark_modified()
         return last_reference
 
     def update_minimal(self, key, minikind, executable=False, fingerprint='',
@@ -2933,7 +3005,7 @@
             if not present:
                 self._dirblocks.insert(block_index, (subdir_key[0], []))
 
-        self._dirblock_state = DirState.IN_MEMORY_MODIFIED
+        self._mark_modified()
 
     def _maybe_remove_row(self, block, index, id_index):
         """Remove index if it is absent or relocated across the row.
@@ -3242,6 +3314,8 @@
         else:
             worth_saving = False
     elif minikind == 'l':
+        if saved_minikind == 'l':
+            worth_saving = False
         link_or_sha1 = state._read_link(abspath, saved_link_or_sha1)
         if state._cutoff_time is None:
             state._sha_cutoff_time()
@@ -3253,7 +3327,7 @@
             entry[1][0] = ('l', '', stat_value.st_size,
                            False, DirState.NULLSTAT)
     if worth_saving:
-        state._dirblock_state = DirState.IN_MEMORY_MODIFIED
+        state._mark_modified([entry])
     return link_or_sha1
 
 

=== modified file 'bzrlib/tests/per_workingtree/test_workingtree.py'
--- a/bzrlib/tests/per_workingtree/test_workingtree.py	2011-04-21 20:32:16 +0000
+++ b/bzrlib/tests/per_workingtree/test_workingtree.py	2011-05-06 15:15:44 +0000
@@ -1122,3 +1122,38 @@
         # above the control dir but we might need to relax that?
         self.assertEqual(wt.control_url.find(wt.user_url), 0)
         self.assertEqual(wt.control_url, wt.control_transport.base)
+
+
+class TestWorthSavingLimit(TestCaseWithWorkingTree):
+
+    def make_wt_with_worth_saving_limit(self):
+        wt = self.make_branch_and_tree('wt')
+        if getattr(wt, '_worth_saving_limit', None) is None:
+            raise tests.TestNotApplicable('no _worth_saving_limit for'
+                                          ' this tree type')
+        wt.lock_write()
+        self.addCleanup(wt.unlock)
+        return wt
+
+    def test_not_set(self):
+        # Default should be 10
+        wt = self.make_wt_with_worth_saving_limit()
+        self.assertEqual(10, wt._worth_saving_limit())
+        ds = wt.current_dirstate()
+        self.assertEqual(10, ds._worth_saving_limit)
+
+    def test_set_in_branch(self):
+        wt = self.make_wt_with_worth_saving_limit()
+        config = wt.branch.get_config()
+        config.set_user_option('bzr.workingtree.worth_saving_limit', '20')
+        self.assertEqual(20, wt._worth_saving_limit())
+        ds = wt.current_dirstate()
+        self.assertEqual(10, ds._worth_saving_limit)
+
+    def test_invalid(self):
+        wt = self.make_wt_with_worth_saving_limit()
+        config = wt.branch.get_config()
+        config.set_user_option('bzr.workingtree.worth_saving_limit', 'a')
+        # If the config entry is invalid, default to 10
+        # TODO: This writes a warning to the user, trap it somehow
+        self.assertEqual(10, wt._worth_saving_limit())

=== modified file 'bzrlib/tests/test__dirstate_helpers.py'
--- a/bzrlib/tests/test__dirstate_helpers.py	2011-04-29 10:14:38 +0000
+++ b/bzrlib/tests/test__dirstate_helpers.py	2011-05-06 15:15:44 +0000
@@ -822,25 +822,34 @@
 
     def test_observed_sha1_cachable(self):
         state, entry = self.get_state_with_a()
+        state.save()
         atime = time.time() - 10
         self.build_tree(['a'])
-        statvalue = os.lstat('a')
-        statvalue = test_dirstate._FakeStat(statvalue.st_size, atime, atime,
-            statvalue.st_dev, statvalue.st_ino, statvalue.st_mode)
+        statvalue = test_dirstate._FakeStat.from_stat(os.lstat('a'))
+        statvalue.st_mtime = statvalue.st_ctime = atime
+        self.assertEqual(dirstate.DirState.IN_MEMORY_UNMODIFIED,
+                         state._dirblock_state)
         state._observed_sha1(entry, "foo", statvalue)
         self.assertEqual('foo', entry[1][0][1])
         packed_stat = dirstate.pack_stat(statvalue)
         self.assertEqual(packed_stat, entry[1][0][4])
+        self.assertEqual(dirstate.DirState.IN_MEMORY_HASH_MODIFIED,
+                         state._dirblock_state)
 
     def test_observed_sha1_not_cachable(self):
         state, entry = self.get_state_with_a()
+        state.save()
         oldval = entry[1][0][1]
         oldstat = entry[1][0][4]
         self.build_tree(['a'])
         statvalue = os.lstat('a')
+        self.assertEqual(dirstate.DirState.IN_MEMORY_UNMODIFIED,
+                         state._dirblock_state)
         state._observed_sha1(entry, "foo", statvalue)
         self.assertEqual(oldval, entry[1][0][1])
         self.assertEqual(oldstat, entry[1][0][4])
+        self.assertEqual(dirstate.DirState.IN_MEMORY_UNMODIFIED,
+                         state._dirblock_state)
 
     def test_update_entry(self):
         state, _ = self.get_state_with_a()
@@ -959,35 +968,38 @@
         # Dirblock is not updated (the link is too new)
         self.assertEqual([('l', '', 6, False, dirstate.DirState.NULLSTAT)],
                          entry[1])
-        self.assertEqual(dirstate.DirState.IN_MEMORY_MODIFIED,
+        # The file entry turned into a symlink, that is considered
+        # HASH modified worthy.
+        self.assertEqual(dirstate.DirState.IN_MEMORY_HASH_MODIFIED,
                          state._dirblock_state)
 
         # Because the stat_value looks new, we should re-read the target
+        del state._log[:]
         link_or_sha1 = self.update_entry(state, entry, abspath='a',
                                           stat_value=stat_value)
         self.assertEqual('target', link_or_sha1)
-        self.assertEqual([('read_link', 'a', ''),
-                          ('read_link', 'a', ''),
-                         ], state._log)
+        self.assertEqual([('read_link', 'a', '')], state._log)
         self.assertEqual([('l', '', 6, False, dirstate.DirState.NULLSTAT)],
                          entry[1])
+        state.save()
         state.adjust_time(+20) # Skip into the future, all files look old
+        del state._log[:]
         link_or_sha1 = self.update_entry(state, entry, abspath='a',
                                           stat_value=stat_value)
+        # The symlink stayed a symlink. So while it is new enough to cache, we
+        # don't bother setting the flag, because it is not really worth saving
+        # (when we stat the symlink, we'll have paged in the target.)
+        self.assertEqual(dirstate.DirState.IN_MEMORY_UNMODIFIED,
+                         state._dirblock_state)
         self.assertEqual('target', link_or_sha1)
         # We need to re-read the link because only now can we cache it
-        self.assertEqual([('read_link', 'a', ''),
-                          ('read_link', 'a', ''),
-                          ('read_link', 'a', ''),
-                         ], state._log)
+        self.assertEqual([('read_link', 'a', '')], state._log)
         self.assertEqual([('l', 'target', 6, False, packed_stat)],
                          entry[1])
 
+        del state._log[:]
         # Another call won't re-read the link
-        self.assertEqual([('read_link', 'a', ''),
-                          ('read_link', 'a', ''),
-                          ('read_link', 'a', ''),
-                         ], state._log)
+        self.assertEqual([], state._log)
         link_or_sha1 = self.update_entry(state, entry, abspath='a',
                                           stat_value=stat_value)
         self.assertEqual('target', link_or_sha1)

=== modified file 'bzrlib/tests/test_dirstate.py'
--- a/bzrlib/tests/test_dirstate.py	2011-04-19 14:22:27 +0000
+++ b/bzrlib/tests/test_dirstate.py	2011-04-22 14:12:22 +0000
@@ -592,7 +592,7 @@
 
             # The dirblock has been updated
             self.assertEqual(st.st_size, entry[1][0][2])
-            self.assertEqual(dirstate.DirState.IN_MEMORY_MODIFIED,
+            self.assertEqual(dirstate.DirState.IN_MEMORY_HASH_MODIFIED,
                              state._dirblock_state)
 
             del entry
@@ -625,7 +625,7 @@
             sha1sum = dirstate.update_entry(state, entry, 'a-file', st)
             self.assertEqual('ecc5374e9ed82ad3ea3b4d452ea995a5fd3e70e3',
                              sha1sum)
-            self.assertEqual(dirstate.DirState.IN_MEMORY_MODIFIED,
+            self.assertEqual(dirstate.DirState.IN_MEMORY_HASH_MODIFIED,
                              state._dirblock_state)
 
             # Now, before we try to save, grab another dirstate, and take out a
@@ -1339,6 +1339,53 @@
             tree1.unlock()
 
 
+class TestDirStateHashUpdates(TestCaseWithDirState):
+
+    def do_update_entry(self, state, path):
+        entry = state._get_entry(0, path_utf8=path)
+        stat = os.lstat(path)
+        return dirstate.update_entry(state, entry, os.path.abspath(path), stat)
+
+    def test_worth_saving_limit_avoids_writing(self):
+        tree = self.make_branch_and_tree('.')
+        self.build_tree(['c', 'd'])
+        tree.lock_write()
+        tree.add(['c', 'd'], ['c-id', 'd-id'])
+        tree.commit('add c and d')
+        state = InstrumentedDirState.on_file(tree.current_dirstate()._filename,
+                                             worth_saving_limit=2)
+        tree.unlock()
+        state.lock_write()
+        self.addCleanup(state.unlock)
+        state._read_dirblocks_if_needed()
+        state.adjust_time(+20) # Allow things to be cached
+        self.assertEqual(dirstate.DirState.IN_MEMORY_UNMODIFIED,
+                         state._dirblock_state)
+        f = open(state._filename, 'rb')
+        try:
+            content = f.read()
+        finally:
+            f.close()
+        self.do_update_entry(state, 'c')
+        self.assertEqual(1, len(state._known_hash_changes))
+        self.assertEqual(dirstate.DirState.IN_MEMORY_HASH_MODIFIED,
+                         state._dirblock_state)
+        state.save()
+        # It should not have set the state to IN_MEMORY_UNMODIFIED because the
+        # hash values haven't been written out.
+        self.assertEqual(dirstate.DirState.IN_MEMORY_HASH_MODIFIED,
+                         state._dirblock_state)
+        self.assertFileEqual(content, state._filename)
+        self.assertEqual(dirstate.DirState.IN_MEMORY_HASH_MODIFIED,
+                         state._dirblock_state)
+        self.do_update_entry(state, 'd')
+        self.assertEqual(2, len(state._known_hash_changes))
+        state.save()
+        self.assertEqual(dirstate.DirState.IN_MEMORY_UNMODIFIED,
+                         state._dirblock_state)
+        self.assertEqual(0, len(state._known_hash_changes))
+
+
 class TestGetLines(TestCaseWithDirState):
 
     def test_get_line_with_2_rows(self):
@@ -1737,8 +1784,9 @@
 class InstrumentedDirState(dirstate.DirState):
     """An DirState with instrumented sha1 functionality."""
 
-    def __init__(self, path, sha1_provider):
-        super(InstrumentedDirState, self).__init__(path, sha1_provider)
+    def __init__(self, path, sha1_provider, worth_saving_limit=0):
+        super(InstrumentedDirState, self).__init__(path, sha1_provider,
+            worth_saving_limit=worth_saving_limit)
         self._time_offset = 0
         self._log = []
         # member is dynamically set in DirState.__init__ to turn on trace

=== modified file 'bzrlib/workingtree_4.py'
--- a/bzrlib/workingtree_4.py	2011-05-03 23:54:46 +0000
+++ b/bzrlib/workingtree_4.py	2011-05-06 15:15:44 +0000
@@ -76,6 +76,8 @@
 
 class DirStateWorkingTree(InventoryWorkingTree):
 
+    _DEFAULT_WORTH_SAVING_LIMIT = 10
+
     def __init__(self, basedir,
                  branch,
                  _control_files=None,
@@ -229,7 +231,7 @@
         local_path = self.bzrdir.get_workingtree_transport(None
             ).local_abspath('dirstate')
         self._dirstate = dirstate.DirState.on_file(local_path,
-            self._sha1_provider())
+            self._sha1_provider(), self._worth_saving_limit())
         return self._dirstate
 
     def _sha1_provider(self):
@@ -244,6 +246,26 @@
         else:
             return None
 
+    def _worth_saving_limit(self):
+        """How many hash changes are ok before we must save the dirstate.
+
+        :return: an integer. -1 means never save.
+        """
+        config = self.branch.get_config()
+        val = config.get_user_option('bzr.workingtree.worth_saving_limit')
+        if val is None:
+            val = self._DEFAULT_WORTH_SAVING_LIMIT
+        else:
+            try:
+                val = int(val)
+            except ValueError, e:
+                trace.warning('Invalid config value for'
+                              ' "bzr.workingtree.worth_saving_limit"'
+                              ' value %r is not an integer.'
+                              % (val,))
+                val = self._DEFAULT_WORTH_SAVING_LIMIT
+        return val
+
     def filter_unversioned_files(self, paths):
         """Filter out paths that are versioned.
 
@@ -860,7 +882,7 @@
                 rollback_rename()
                 raise
             result.append((from_rel, to_rel))
-            state._dirblock_state = dirstate.DirState.IN_MEMORY_MODIFIED
+            state._mark_modified()
             self._make_dirty(reset_inventory=False)
 
         return result

=== modified file 'doc/developers/dirstate.txt'
--- a/doc/developers/dirstate.txt	2009-09-09 15:51:18 +0000
+++ b/doc/developers/dirstate.txt	2011-04-22 14:12:22 +0000
@@ -4,3 +4,57 @@
 Don't really need the hashes of the current versions - just knowing
 whether they've changed or not will generally be enough - and just the
 mtime and ctime of a point in time may be enough?
+
+
+``_dirblock_state``
+-------------------
+
+There are currently 4 levels that state can have.
+
+ 1. NOT_IN_MEMORY
+    The actual content blocks have not been read at all.
+ 2. IN_MEMORY_UNMODIFIED
+    The content blocks have been read and are available for use. They have
+    not been changed at all versus what was written on disk when we read
+    them.
+ 3. IN_MEMORY_HASH_MODIFIED
+    We have updated the in-memory state, but only to record the
+    sha1/symlink target value and the stat value that means this
+    information is 'fresh'.
+ 4. IN_MEMORY_MODIFIED
+    We have updated an actual record. (Parent lists, added a new file,
+    deleted something, etc.) In this state, we must always write out the
+    dirstate, or some user action will be lost.
+
+
+IN_MEMORY_HASH_MODIFIED
+~~~~~~~~~~~~~~~~~~~~~~~
+
+This state is a bit special, so deserves its own topic.  If we are
+IN_MEMORY_HASH_MODIFIED, we only write out the dirstate if enough records
+have been updated. The idea is that if we would save future I/O by writing
+an updated dirstate, then we should do so. The threshold for this is set
+by "worth_saving_limit". The default is that at least 10 entries must be
+updated in order to consider the dirstate file worth updating.
+
+Going one step further, newly added files, symlinks, and directory entries
+updates are treated specially. We know that we will always stat all
+entries in the tree so that we can observe *if* they have changed. In the
+case of directories, all the information we know about them is just from
+that stat value. There is no extra content to read. So an update directory
+entry doesn't cause us to update to IN_MEMORY_HASH_MODIFIED. However, if
+there are other modifications worth saving, we will go ahead and save the
+directory entry update at the same time.
+
+Similarly, symlink targets are commonly stored in the inode entry
+directly. So once we have stat'ed the symlink, we already have its target
+information in memory. The one caveat is if we used to think an object was
+a file, and it became a directory or symlink, then we will treat it as
+worth saving.
+
+In the case of newly added files, we never have to read their content to
+know that they are different from the basis tree. So saving the updated
+information also won't save a future read.
+
+
+.. vim: ft=rst tw=74 et

=== modified file 'doc/en/release-notes/bzr-2.4.txt'
--- a/doc/en/release-notes/bzr-2.4.txt	2011-05-06 07:25:24 +0000
+++ b/doc/en/release-notes/bzr-2.4.txt	2011-05-06 15:17:33 +0000
@@ -29,6 +29,13 @@
 * Slightly reduced memory consumption when fetching into a 2a repository
   by reusing existing caching a little better.  (Andrew Bennetts)
 
+* Speed up ``bzr status`` by a little bit when there are a couple of
+  modified files. We now track how many files we have seen that need
+  updating, and only rewrite the dirstate file if enough of them have
+  changed. The default is 10, and can be overridden by setting the branch
+  option "``bzr.workingtree.worth_saving_limit``".
+  (Ian Clatworthy, John Arbash Meinel, #380202)
+
 Bug Fixes
 *********
 
@@ -144,6 +151,11 @@
 * ``bzr log`` now works on revisions which are not in the current branch.
   (Matt Giuca, #241998)
 
+* Don't rewrite the dirstate file when non-interesting changes have
+  occurred. This can significantly improve 'bzr status' times when there
+  are only small changes to a large tree.
+  (Ian Clatworthy, John Arbash Meinel, #380202)
+
 * Lazy hooks are now reset between test runs. (Jelmer Vernooij, #745566)
 
 * ``bzrlib.merge.Merge`` now calls ``iter_changes`` without




More information about the bazaar-commits mailing list