Rev 2499: (John Arbash Meinel, r=robert) Tweak _iter_changes and update_entry to make bzr status and bzr diff much faster in file:///home/pqm/archives/thelove/bzr/%2Btrunk/

Canonical.com Patch Queue Manager pqm at pqm.ubuntu.com
Thu May 31 22:08:37 BST 2007


At file:///home/pqm/archives/thelove/bzr/%2Btrunk/

------------------------------------------------------------
revno: 2499
revision-id: pqm at pqm.ubuntu.com-20070531210833-8ptk86ocu822hjd5
parent: pqm at pqm.ubuntu.com-20070530152720-qdjd3zs950nindfs
parent: john at arbash-meinel.com-20070531202904-34h7ygudo7qq9ha1
committer: Canonical.com Patch Queue Manager<pqm at pqm.ubuntu.com>
branch nick: +trunk
timestamp: Thu 2007-05-31 22:08:33 +0100
message:
  (John Arbash Meinel, r=robert) Tweak _iter_changes and update_entry to make bzr status and bzr diff much faster
modified:
  NEWS                           NEWS-20050323055033-4e00b5db738777ff
  bzrlib/dirstate.py             dirstate.py-20060728012006-d6mvoihjb3je9peu-1
  bzrlib/osutils.py              osutils.py-20050309040759-eeaff12fbf77ac86
  bzrlib/tests/test_dirstate.py  test_dirstate.py-20060728012006-d6mvoihjb3je9peu-2
  bzrlib/workingtree_4.py        workingtree_4.py-20070208044105-5fgpc5j3ljlh5q6c-1
    ------------------------------------------------------------
    revno: 2485.3.14
    merged: john at arbash-meinel.com-20070531202904-34h7ygudo7qq9ha1
    parent: john at arbash-meinel.com-20070531141758-i5f7gqddjm497ij8
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: dirstate_optim_2
    timestamp: Thu 2007-05-31 15:29:04 -0500
    message:
      Update the code so that symlinks aren't cached at incorrect times
      and fix the tests so that they don't assume files and symlinks
      get cached even when the timestamp doesn't declare them 'safe'.
    ------------------------------------------------------------
    revno: 2485.3.13
    merged: john at arbash-meinel.com-20070531141758-i5f7gqddjm497ij8
    parent: john at arbash-meinel.com-20070517182306-dhc0cf38l5fhv2dj
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: dirstate_optim_2
    timestamp: Thu 2007-05-31 09:17:58 -0500
    message:
      NEWS
    ------------------------------------------------------------
    revno: 2485.3.12
    merged: john at arbash-meinel.com-20070517182306-dhc0cf38l5fhv2dj
    parent: john at arbash-meinel.com-20070517182243-3844dyynbautwcwu
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: dirstate_optim_2
    timestamp: Thu 2007-05-17 19:23:06 +0100
    message:
      Delay joining the path until we are actually going to be using it.
    ------------------------------------------------------------
    revno: 2485.3.11
    merged: john at arbash-meinel.com-20070517182243-3844dyynbautwcwu
    parent: john at arbash-meinel.com-20070517175221-edsgtdww5ogxhmqv
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: dirstate_optim_2
    timestamp: Thu 2007-05-17 19:22:43 +0100
    message:
      Correct DirState.update_entry to check when caching the sha1
      not when returning it.
    ------------------------------------------------------------
    revno: 2485.3.10
    merged: john at arbash-meinel.com-20070517175221-edsgtdww5ogxhmqv
    parent: john at arbash-meinel.com-20070517171133-l0rfa3x4j2zef8jf
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: dirstate_optim_2
    timestamp: Thu 2007-05-17 18:52:21 +0100
    message:
      Remove some extra comments.
    ------------------------------------------------------------
    revno: 2485.3.9
    merged: john at arbash-meinel.com-20070517171133-l0rfa3x4j2zef8jf
    parent: john at arbash-meinel.com-20070517170734-munw3o4jwwey6jpe
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: dirstate_optim_2
    timestamp: Thu 2007-05-17 18:11:33 +0100
    message:
      Now that we know when tuples will be yielded
      go ahead and spend the effort to make them *nice* tuples.
    ------------------------------------------------------------
    revno: 2485.3.8
    merged: john at arbash-meinel.com-20070517170734-munw3o4jwwey6jpe
    parent: john at arbash-meinel.com-20070516062651-rb6kuunel56yh0dj
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: dirstate_optim_2
    timestamp: Thu 2007-05-17 18:07:34 +0100
    message:
      Keep track of directories that you have seen.
      Only fall back to _get_entry if that misses.
    ------------------------------------------------------------
    revno: 2485.3.7
    merged: john at arbash-meinel.com-20070516062651-rb6kuunel56yh0dj
    parent: john at arbash-meinel.com-20070516061855-zj7l63ffzw822ztg
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: dirstate_optim_1
    timestamp: Wed 2007-05-16 07:26:51 +0100
    message:
      Comment on some alternative 'pack_stat'
    ------------------------------------------------------------
    revno: 2485.3.6
    merged: john at arbash-meinel.com-20070516061855-zj7l63ffzw822ztg
    parent: john at arbash-meinel.com-20070516055758-4qmxk53j07k6lnzz
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: dirstate_optim_1
    timestamp: Wed 2007-05-16 07:18:55 +0100
    message:
      Delay calling pathjoin() until we've figured out we're going to use it
      (this is only a small win)
    ------------------------------------------------------------
    revno: 2485.3.5
    merged: john at arbash-meinel.com-20070516055758-4qmxk53j07k6lnzz
    parent: john at arbash-meinel.com-20070516054627-kv3otbqzd1bi22yx
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: dirstate_optim_1
    timestamp: Wed 2007-05-16 06:57:58 +0100
    message:
      switching to a single returned object from _process_entry is not faster
      using result = _process_entry, if result is not ... is not strictly
      faster than for result in _process_entry. I'm not sure why.
      I think it is a little cleaner, though.
    ------------------------------------------------------------
    revno: 2485.3.4
    merged: john at arbash-meinel.com-20070516054627-kv3otbqzd1bi22yx
    parent: john at arbash-meinel.com-20070515232720-4alrtd5q4p4m3q1r
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: dirstate_optim_1
    timestamp: Wed 2007-05-16 06:46:27 +0100
    message:
      Change _process_entry to detect when we don't care about something.
      This lets us do the standard check on local variables
      and handles the *very* common case of nothing being changed.
    ------------------------------------------------------------
    revno: 2485.3.3
    merged: john at arbash-meinel.com-20070515232720-4alrtd5q4p4m3q1r
    parent: john at arbash-meinel.com-20070515224039-h6arou8igzv9k34l
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: dirstate_optim_1
    timestamp: Wed 2007-05-16 00:27:20 +0100
    message:
      Avoid extra work in inner 'DirState.update_entry' code.
      Use local vars for most functinos, directly lookup in local dict rather
      than going through osutils.stat_to_file_kind.
      Require the stat to be passed in, rather than being optional.
      change pack_stat to use binascii.b2a_base64 (the underlying function)
      Move pack_stat to top of code, so it can be saved as local variable.
      This cuts the time spent in update_entry to 1/2.
      Avoiding pack_stat is also a potentially large win. (5% or so)
    ------------------------------------------------------------
    revno: 2485.3.2
    merged: john at arbash-meinel.com-20070515224039-h6arou8igzv9k34l
    parent: john at arbash-meinel.com-20070515221632-xphatel2dqh29ya7
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: dirstate_optim_1
    timestamp: Tue 2007-05-15 23:40:39 +0100
    message:
      Handle the case when getfilesystemencoding() is lowercase.
    ------------------------------------------------------------
    revno: 2485.3.1
    merged: john at arbash-meinel.com-20070515221632-xphatel2dqh29ya7
    parent: pqm at pqm.ubuntu.com-20070510055501-w262sk5hl33vmd19
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: dirstate_optim_1
    timestamp: Tue 2007-05-15 23:16:32 +0100
    message:
      Fix DirState handling of dir records.
      We were comparing stat_value.st_size against the recorded directory size
      which is always 0.
=== modified file 'NEWS'
--- a/NEWS	2007-05-30 15:27:20 +0000
+++ b/NEWS	2007-05-31 21:08:33 +0000
@@ -59,6 +59,11 @@
       the root of the source tree and allows HACKING to be split into multiple
       files. (Robert Collins, Alexander Belchenko)
 
+    * Clean up the ``WorkingTree4._iter_changes()`` internal loops as well as
+      ``DirState.update_entry()``. This optimizes the core logic for ``bzr
+      diff`` and ``bzr status`` significantly improving the speed of
+      both. (John Arbash Meinel)
+
 bzr 0.16rc2  2007-04-30
 
   BUGFIXES:

=== modified file 'bzrlib/dirstate.py'
--- a/bzrlib/dirstate.py	2007-04-21 14:39:50 +0000
+++ b/bzrlib/dirstate.py	2007-05-31 20:29:04 +0000
@@ -200,11 +200,12 @@
 """
 
 
-import base64
+import binascii
 import bisect
 import errno
 import os
 from stat import S_IEXEC
+import stat
 import struct
 import sys
 import time
@@ -223,6 +224,29 @@
     """This just keeps track of information as we are bisecting."""
 
 
+def pack_stat(st, _encode=binascii.b2a_base64, _pack=struct.pack):
+    """Convert stat values into a packed representation."""
+    # jam 20060614 it isn't really worth removing more entries if we
+    # are going to leave it in packed form.
+    # With only st_mtime and st_mode filesize is 5.5M and read time is 275ms
+    # With all entries filesize is 5.9M and read time is mabye 280ms
+    # well within the noise margin
+
+    # base64 encoding always adds a final newline, so strip it off
+    # The current version
+    return _encode(_pack('>LLLLLL'
+        , st.st_size, int(st.st_mtime), int(st.st_ctime)
+        , st.st_dev, st.st_ino & 0xFFFFFFFF, st.st_mode))[:-1]
+    # This is 0.060s / 1.520s faster by not encoding as much information
+    # return _encode(_pack('>LL', int(st.st_mtime), st.st_mode))[:-1]
+    # This is not strictly faster than _encode(_pack())[:-1]
+    # return '%X.%X.%X.%X.%X.%X' % (
+    #      st.st_size, int(st.st_mtime), int(st.st_ctime),
+    #      st.st_dev, st.st_ino, st.st_mode)
+    # Similar to the _encode(_pack('>LL'))
+    # return '%X.%X' % (int(st.st_mtime), st.st_mode)
+
+
 class DirState(object):
     """Record directory and metadata state for fast access.
 
@@ -255,6 +279,11 @@
             'r': 'relocated',
             't': 'tree-reference',
         }
+    _stat_to_minikind = {
+        stat.S_IFDIR:'d',
+        stat.S_IFREG:'f',
+        stat.S_IFLNK:'l',
+    }
     _to_yesno = {True:'y', False: 'n'} # TODO profile the performance gain
      # of using int conversion rather than a dict here. AND BLAME ANDREW IF
      # it is faster.
@@ -1059,7 +1088,9 @@
             raise
         return result
 
-    def update_entry(self, entry, abspath, stat_value=None):
+    def update_entry(self, entry, abspath, stat_value,
+                     _stat_to_minikind=_stat_to_minikind,
+                     _pack_stat=pack_stat):
         """Update the entry based on what is actually on disk.
 
         :param entry: This is the dirblock entry for the file in question.
@@ -1069,46 +1100,24 @@
         :return: The sha1 hexdigest of the file (40 bytes) or link target of a
                 symlink.
         """
-        # This code assumes that the entry passed in is directly held in one of
-        # the internal _dirblocks. So the dirblock state must have already been
-        # read.
-        assert self._dirblock_state != DirState.NOT_IN_MEMORY
-        if stat_value is None:
-            try:
-                # We could inline os.lstat but the common case is that
-                # stat_value will be passed in, not read here.
-                stat_value = self._lstat(abspath, entry)
-            except (OSError, IOError), e:
-                if e.errno in (errno.ENOENT, errno.EACCES,
-                               errno.EPERM):
-                    # The entry is missing, consider it gone
-                    return None
-                raise
-
-        kind = osutils.file_kind_from_stat_mode(stat_value.st_mode)
         try:
-            minikind = DirState._kind_to_minikind[kind]
-        except KeyError: # Unknown kind
+            minikind = _stat_to_minikind[stat_value.st_mode & 0170000]
+        except KeyError:
+            # Unhandled kind
             return None
-        packed_stat = pack_stat(stat_value)
+        packed_stat = _pack_stat(stat_value)
         (saved_minikind, saved_link_or_sha1, saved_file_size,
          saved_executable, saved_packed_stat) = entry[1][0]
 
         if (minikind == saved_minikind
-            and packed_stat == saved_packed_stat
+            and packed_stat == saved_packed_stat):
+            # The stat hasn't changed since we saved, so we can re-use the
+            # saved sha hash.
+            if minikind == 'd':
+                return None
+
             # size should also be in packed_stat
-            and saved_file_size == stat_value.st_size):
-            # The stat hasn't changed since we saved, so we can potentially
-            # re-use the saved sha hash.
-            if minikind == 'd':
-                return None
-
-            if self._cutoff_time is None:
-                self._sha_cutoff_time()
-
-            if (stat_value.st_mtime < self._cutoff_time
-                and stat_value.st_ctime < self._cutoff_time):
-                # Return the existing fingerprint
+            if saved_file_size == stat_value.st_size:
                 return saved_link_or_sha1
 
         # If we have gotten this far, that means that we need to actually
@@ -1118,8 +1127,15 @@
             link_or_sha1 = self._sha1_file(abspath, entry)
             executable = self._is_executable(stat_value.st_mode,
                                              saved_executable)
-            entry[1][0] = ('f', link_or_sha1, stat_value.st_size,
-                           executable, packed_stat)
+            if self._cutoff_time is None:
+                self._sha_cutoff_time()
+            if (stat_value.st_mtime < self._cutoff_time
+                and stat_value.st_ctime < self._cutoff_time):
+                entry[1][0] = ('f', link_or_sha1, stat_value.st_size,
+                               executable, packed_stat)
+            else:
+                entry[1][0] = ('f', '', stat_value.st_size,
+                               executable, DirState.NULLSTAT)
         elif minikind == 'd':
             link_or_sha1 = None
             entry[1][0] = ('d', '', 0, False, packed_stat)
@@ -1133,8 +1149,15 @@
                                    osutils.pathjoin(entry[0][0], entry[0][1]))
         elif minikind == 'l':
             link_or_sha1 = self._read_link(abspath, saved_link_or_sha1)
-            entry[1][0] = ('l', link_or_sha1, stat_value.st_size,
-                           False, packed_stat)
+            if self._cutoff_time is None:
+                self._sha_cutoff_time()
+            if (stat_value.st_mtime < self._cutoff_time
+                and stat_value.st_ctime < self._cutoff_time):
+                entry[1][0] = ('l', link_or_sha1, stat_value.st_size,
+                               False, packed_stat)
+            else:
+                entry[1][0] = ('l', '', stat_value.st_size,
+                               False, DirState.NULLSTAT)
         self._dirblock_state = DirState.IN_MEMORY_MODIFIED
         return link_or_sha1
 
@@ -2379,18 +2402,3 @@
         if cur_split < dirname_split: lo = mid+1
         else: hi = mid
     return lo
-
-
-
-def pack_stat(st, _encode=base64.encodestring, _pack=struct.pack):
-    """Convert stat values into a packed representation."""
-    # jam 20060614 it isn't really worth removing more entries if we
-    # are going to leave it in packed form.
-    # With only st_mtime and st_mode filesize is 5.5M and read time is 275ms
-    # With all entries filesize is 5.9M and read time is mabye 280ms
-    # well within the noise margin
-
-    # base64.encode always adds a final newline, so strip it off
-    return _encode(_pack('>LLLLLL'
-        , st.st_size, int(st.st_mtime), int(st.st_ctime)
-        , st.st_dev, st.st_ino & 0xFFFFFFFF, st.st_mode))[:-1]

=== modified file 'bzrlib/osutils.py'
--- a/bzrlib/osutils.py	2007-04-19 08:03:19 +0000
+++ b/bzrlib/osutils.py	2007-05-15 22:40:39 +0000
@@ -1149,7 +1149,7 @@
         path-from-top might be unicode or utf8, but it is the correct path to
         pass to os functions to affect the file in question. (such as os.lstat)
     """
-    fs_encoding = sys.getfilesystemencoding()
+    fs_encoding = sys.getfilesystemencoding().upper()
     if (sys.platform == 'win32' or
         fs_encoding not in ('UTF-8', 'US-ASCII', 'ANSI_X3.4-1968')): # ascii
         return _walkdirs_unicode_to_utf8(top, prefix=prefix)

=== modified file 'bzrlib/tests/test_dirstate.py'
--- a/bzrlib/tests/test_dirstate.py	2007-04-19 18:27:56 +0000
+++ b/bzrlib/tests/test_dirstate.py	2007-05-31 20:29:04 +0000
@@ -558,6 +558,9 @@
             self.assertEqual('', entry[1][0][1])
             # We should have a real entry.
             self.assertNotEqual((None, None), entry)
+            # Make sure everything is old enough
+            state._sha_cutoff_time()
+            state._cutoff_time += 10
             sha1sum = state.update_entry(entry, 'a-file', os.lstat('a-file'))
             # We should have gotten a real sha1
             self.assertEqual('ecc5374e9ed82ad3ea3b4d452ea995a5fd3e70e3',
@@ -1421,8 +1424,8 @@
         self.assertEqual('b50e5406bb5e153ebbeb20268fcf37c87e1ecfb6',
                          link_or_sha1)
 
-        # The dirblock entry should be updated with the new info
-        self.assertEqual([('f', link_or_sha1, 14, False, packed_stat)],
+        # The dirblock entry should not cache the file's sha1
+        self.assertEqual([('f', '', 14, False, dirstate.DirState.NULLSTAT)],
                          entry[1])
         self.assertEqual(dirstate.DirState.IN_MEMORY_MODIFIED,
                          state._dirblock_state)
@@ -1447,32 +1450,35 @@
                          link_or_sha1)
         self.assertEqual(dirstate.DirState.IN_MEMORY_MODIFIED,
                          state._dirblock_state)
+        self.assertEqual([('f', '', 14, False, dirstate.DirState.NULLSTAT)],
+                         entry[1])
         state.save()
 
         # However, if we move the clock forward so the file is considered
-        # "stable", it should just returned the cached value.
-        state.adjust_time(20)
+        # "stable", it should just cache the value.
+        state.adjust_time(+20)
         link_or_sha1 = state.update_entry(entry, abspath='a',
                                           stat_value=stat_value)
         self.assertEqual('b50e5406bb5e153ebbeb20268fcf37c87e1ecfb6',
                          link_or_sha1)
         self.assertEqual([('sha1', 'a'), ('is_exec', mode, False),
                           ('sha1', 'a'), ('is_exec', mode, False),
+                          ('sha1', 'a'), ('is_exec', mode, False),
                          ], state._log)
+        self.assertEqual([('f', link_or_sha1, 14, False, packed_stat)],
+                         entry[1])
 
-    def test_update_entry_no_stat_value(self):
-        """Passing the stat_value is optional."""
-        state, entry = self.get_state_with_a()
-        state.adjust_time(-10) # Make sure the file looks new
-        self.build_tree(['a'])
-        # Add one where we don't provide the stat or sha already
-        link_or_sha1 = state.update_entry(entry, abspath='a')
+        # Subsequent calls will just return the cached value
+        link_or_sha1 = state.update_entry(entry, abspath='a',
+                                          stat_value=stat_value)
         self.assertEqual('b50e5406bb5e153ebbeb20268fcf37c87e1ecfb6',
                          link_or_sha1)
-        stat_value = os.lstat('a')
-        self.assertEqual([('lstat', 'a'), ('sha1', 'a'),
-                          ('is_exec', stat_value.st_mode, False),
+        self.assertEqual([('sha1', 'a'), ('is_exec', mode, False),
+                          ('sha1', 'a'), ('is_exec', mode, False),
+                          ('sha1', 'a'), ('is_exec', mode, False),
                          ], state._log)
+        self.assertEqual([('f', link_or_sha1, 14, False, packed_stat)],
+                         entry[1])
 
     def test_update_entry_symlink(self):
         """Update entry should read symlinks."""
@@ -1492,8 +1498,8 @@
                                           stat_value=stat_value)
         self.assertEqual('target', link_or_sha1)
         self.assertEqual([('read_link', 'a', '')], state._log)
-        # Dirblock is updated
-        self.assertEqual([('l', link_or_sha1, 6, False, packed_stat)],
+        # Dirblock is not updated (the link is too new)
+        self.assertEqual([('l', '', 6, False, dirstate.DirState.NULLSTAT)],
                          entry[1])
         self.assertEqual(dirstate.DirState.IN_MEMORY_MODIFIED,
                          state._dirblock_state)
@@ -1503,23 +1509,70 @@
                                           stat_value=stat_value)
         self.assertEqual('target', link_or_sha1)
         self.assertEqual([('read_link', 'a', ''),
-                          ('read_link', 'a', 'target'),
+                          ('read_link', 'a', ''),
                          ], state._log)
+        self.assertEqual([('l', '', 6, False, dirstate.DirState.NULLSTAT)],
+                         entry[1])
         state.adjust_time(+20) # Skip into the future, all files look old
         link_or_sha1 = state.update_entry(entry, abspath='a',
                                           stat_value=stat_value)
         self.assertEqual('target', link_or_sha1)
-        # There should not be a new read_link call.
-        # (this is a weak assertion, because read_link is fairly inexpensive,
-        # versus the number of symlinks that we would have)
-        self.assertEqual([('read_link', 'a', ''),
-                          ('read_link', 'a', 'target'),
-                         ], state._log)
+        # We need to re-read the link because only now can we cache it
+        self.assertEqual([('read_link', 'a', ''),
+                          ('read_link', 'a', ''),
+                          ('read_link', 'a', ''),
+                         ], state._log)
+        self.assertEqual([('l', 'target', 6, False, packed_stat)],
+                         entry[1])
+
+        # Another call won't re-read the link
+        self.assertEqual([('read_link', 'a', ''),
+                          ('read_link', 'a', ''),
+                          ('read_link', 'a', ''),
+                         ], state._log)
+        link_or_sha1 = state.update_entry(entry, abspath='a',
+                                          stat_value=stat_value)
+        self.assertEqual('target', link_or_sha1)
+        self.assertEqual([('l', 'target', 6, False, packed_stat)],
+                         entry[1])
+
+    def do_update_entry(self, state, entry, abspath):
+        stat_value = os.lstat(abspath)
+        return state.update_entry(entry, abspath, stat_value)
 
     def test_update_entry_dir(self):
         state, entry = self.get_state_with_a()
         self.build_tree(['a/'])
-        self.assertIs(None, state.update_entry(entry, 'a'))
+        self.assertIs(None, self.do_update_entry(state, entry, 'a'))
+
+    def test_update_entry_dir_unchanged(self):
+        state, entry = self.get_state_with_a()
+        self.build_tree(['a/'])
+        state.adjust_time(+20)
+        self.assertIs(None, self.do_update_entry(state, entry, 'a'))
+        self.assertEqual(dirstate.DirState.IN_MEMORY_MODIFIED,
+                         state._dirblock_state)
+        state.save()
+        self.assertEqual(dirstate.DirState.IN_MEMORY_UNMODIFIED,
+                         state._dirblock_state)
+        self.assertIs(None, self.do_update_entry(state, entry, 'a'))
+        self.assertEqual(dirstate.DirState.IN_MEMORY_UNMODIFIED,
+                         state._dirblock_state)
+
+    def test_update_entry_file_unchanged(self):
+        state, entry = self.get_state_with_a()
+        self.build_tree(['a'])
+        sha1sum = 'b50e5406bb5e153ebbeb20268fcf37c87e1ecfb6'
+        state.adjust_time(+20)
+        self.assertEqual(sha1sum, self.do_update_entry(state, entry, 'a'))
+        self.assertEqual(dirstate.DirState.IN_MEMORY_MODIFIED,
+                         state._dirblock_state)
+        state.save()
+        self.assertEqual(dirstate.DirState.IN_MEMORY_UNMODIFIED,
+                         state._dirblock_state)
+        self.assertEqual(sha1sum, self.do_update_entry(state, entry, 'a'))
+        self.assertEqual(dirstate.DirState.IN_MEMORY_UNMODIFIED,
+                         state._dirblock_state)
 
     def create_and_test_file(self, state, entry):
         """Create a file at 'a' and verify the state finds it.
@@ -1531,7 +1584,7 @@
         stat_value = os.lstat('a')
         packed_stat = dirstate.pack_stat(stat_value)
 
-        link_or_sha1 = state.update_entry(entry, abspath='a')
+        link_or_sha1 = self.do_update_entry(state, entry, abspath='a')
         self.assertEqual('b50e5406bb5e153ebbeb20268fcf37c87e1ecfb6',
                          link_or_sha1)
         self.assertEqual([('f', link_or_sha1, 14, False, packed_stat)],
@@ -1548,7 +1601,7 @@
         stat_value = os.lstat('a')
         packed_stat = dirstate.pack_stat(stat_value)
 
-        link_or_sha1 = state.update_entry(entry, abspath='a')
+        link_or_sha1 = self.do_update_entry(state, entry, abspath='a')
         self.assertIs(None, link_or_sha1)
         self.assertEqual([('d', '', 0, False, packed_stat)], entry[1])
 
@@ -1569,50 +1622,19 @@
         stat_value = os.lstat('a')
         packed_stat = dirstate.pack_stat(stat_value)
 
-        link_or_sha1 = state.update_entry(entry, abspath='a')
+        link_or_sha1 = self.do_update_entry(state, entry, abspath='a')
         self.assertEqual('path/to/foo', link_or_sha1)
         self.assertEqual([('l', 'path/to/foo', 11, False, packed_stat)],
                          entry[1])
         return packed_stat
 
-    def test_update_missing_file(self):
-        state, entry = self.get_state_with_a()
-        packed_stat = self.create_and_test_file(state, entry)
-        # Now if we delete the file, update_entry should recover and
-        # return None.
-        os.remove('a')
-        self.assertIs(None, state.update_entry(entry, abspath='a'))
-        # And the record shouldn't be changed.
-        digest = 'b50e5406bb5e153ebbeb20268fcf37c87e1ecfb6'
-        self.assertEqual([('f', digest, 14, False, packed_stat)],
-                         entry[1])
-
-    def test_update_missing_dir(self):
-        state, entry = self.get_state_with_a()
-        packed_stat = self.create_and_test_dir(state, entry)
-        # Now if we delete the directory, update_entry should recover and
-        # return None.
-        os.rmdir('a')
-        self.assertIs(None, state.update_entry(entry, abspath='a'))
-        self.assertEqual([('d', '', 0, False, packed_stat)], entry[1])
-
-    def test_update_missing_symlink(self):
-        if not osutils.has_symlinks():
-            # PlatformDeficiency / TestSkipped
-            raise TestSkipped("No symlink support")
-        state, entry = self.get_state_with_a()
-        packed_stat = self.create_and_test_symlink(state, entry)
-        os.remove('a')
-        self.assertIs(None, state.update_entry(entry, abspath='a'))
-        # And the record shouldn't be changed.
-        self.assertEqual([('l', 'path/to/foo', 11, False, packed_stat)],
-                         entry[1])
-
     def test_update_file_to_dir(self):
         """If a file changes to a directory we return None for the sha.
         We also update the inventory record.
         """
         state, entry = self.get_state_with_a()
+        # The file sha1 won't be cached unless the file is old
+        state.adjust_time(+10)
         self.create_and_test_file(state, entry)
         os.remove('a')
         self.create_and_test_dir(state, entry)
@@ -1623,6 +1645,8 @@
             # PlatformDeficiency / TestSkipped
             raise TestSkipped("No symlink support")
         state, entry = self.get_state_with_a()
+        # The file sha1 won't be cached unless the file is old
+        state.adjust_time(+10)
         self.create_and_test_file(state, entry)
         os.remove('a')
         self.create_and_test_symlink(state, entry)
@@ -1630,6 +1654,8 @@
     def test_update_dir_to_file(self):
         """Directory becoming a file updates the entry."""
         state, entry = self.get_state_with_a()
+        # The file sha1 won't be cached unless the file is old
+        state.adjust_time(+10)
         self.create_and_test_dir(state, entry)
         os.rmdir('a')
         self.create_and_test_file(state, entry)
@@ -1640,6 +1666,8 @@
             # PlatformDeficiency / TestSkipped
             raise TestSkipped("No symlink support")
         state, entry = self.get_state_with_a()
+        # The symlink target won't be cached if it isn't old
+        state.adjust_time(+10)
         self.create_and_test_dir(state, entry)
         os.rmdir('a')
         self.create_and_test_symlink(state, entry)
@@ -1649,6 +1677,8 @@
         if not has_symlinks():
             raise TestSkipped("No symlink support")
         state, entry = self.get_state_with_a()
+        # The symlink and file info won't be cached unless old
+        state.adjust_time(+10)
         self.create_and_test_symlink(state, entry)
         os.remove('a')
         self.create_and_test_file(state, entry)
@@ -1658,6 +1688,8 @@
         if not has_symlinks():
             raise TestSkipped("No symlink support")
         state, entry = self.get_state_with_a()
+        # The symlink target won't be cached if it isn't old
+        state.adjust_time(+10)
         self.create_and_test_symlink(state, entry)
         os.remove('a')
         self.create_and_test_dir(state, entry)
@@ -1677,11 +1709,16 @@
         packed_stat = dirstate.pack_stat(stat_value)
 
         state.adjust_time(-10) # Make sure everything is new
-        # Make sure it wants to kkkkkkkk
         state.update_entry(entry, abspath='a', stat_value=stat_value)
 
         # The row is updated, but the executable bit stays set.
+        self.assertEqual([('f', '', 14, True, dirstate.DirState.NULLSTAT)],
+                         entry[1])
+
+        # Make the disk object look old enough to cache
+        state.adjust_time(+20)
         digest = 'b50e5406bb5e153ebbeb20268fcf37c87e1ecfb6'
+        state.update_entry(entry, abspath='a', stat_value=stat_value)
         self.assertEqual([('f', digest, 14, True, packed_stat)], entry[1])
 
 

=== modified file 'bzrlib/workingtree_4.py'
--- a/bzrlib/workingtree_4.py	2007-04-30 17:45:10 +0000
+++ b/bzrlib/workingtree_4.py	2007-05-17 18:23:06 +0000
@@ -413,6 +413,8 @@
 
         file_abspath = self.abspath(path)
         state = self.current_dirstate()
+        if stat_value is None:
+            stat_value = os.lstat(file_abspath)
         link_or_sha1 = state.update_entry(entry, file_abspath,
                                           stat_value=stat_value)
         if entry[1][0][0] == 'f':
@@ -1666,7 +1668,7 @@
             output. An unversioned file is defined as one with (False, False)
             for the versioned pair.
         """
-        utf8_decode_or_none = cache_utf8._utf8_decode_with_None
+        utf8_decode = cache_utf8._utf8_decode
         _minikind_to_kind = dirstate.DirState._minikind_to_kind
         # NB: show_status depends on being able to pass in non-versioned files
         # and report them as unknown
@@ -1800,11 +1802,21 @@
         NULL_PARENT_DETAILS = dirstate.DirState.NULL_PARENT_DETAILS
         # Using a list so that we can access the values and change them in
         # nested scope. Each one is [path, file_id, entry]
-        last_source_parent = [None, None, None]
-        last_target_parent = [None, None, None]
+        last_source_parent = [None, None]
+        last_target_parent = [None, None]
 
         use_filesystem_for_exec = (sys.platform != 'win32')
 
+        # Just a sentry, so that _process_entry can say that this
+        # record is handled, but isn't interesting to process (unchanged)
+        uninteresting = object()
+
+
+        old_dirname_to_file_id = {}
+        new_dirname_to_file_id = {}
+        # TODO: jam 20070516 - Avoid the _get_entry lookup overhead by
+        #       keeping a cache of directories that we have seen.
+
         def _process_entry(entry, path_info):
             """Compare an entry and real disk to generate delta information.
 
@@ -1814,6 +1826,10 @@
                 Basename is returned as a utf8 string because we expect this
                 tuple will be ignored, and don't want to take the time to
                 decode.
+            :return: None if these don't match
+                     A tuple of information about the change, or
+                     the object 'uninteresting' if these match, but are
+                     basically identical.
             """
             if source_index is None:
                 source_details = NULL_PARENT_DETAILS
@@ -1830,6 +1846,7 @@
                 target_minikind = target_details[0]
             else:
                 link_or_sha1 = None
+            file_id = entry[0][2]
             source_minikind = source_details[0]
             if source_minikind in 'fdltr' and target_minikind in 'fdlt':
                 # claimed content in both: diff
@@ -1857,7 +1874,7 @@
                 else:
                     old_dirname = entry[0][0]
                     old_basename = entry[0][1]
-                    old_path = path = pathjoin(old_dirname, old_basename)
+                    old_path = path = None
                 if path_info is None:
                     # the file is missing on disk, show as removed.
                     content_change = True
@@ -1867,6 +1884,9 @@
                     # source and target are both versioned and disk file is present.
                     target_kind = path_info[2]
                     if target_kind == 'directory':
+                        if path is None:
+                            old_path = path = pathjoin(old_dirname, old_basename)
+                        new_dirname_to_file_id[path] = file_id
                         if source_minikind != 'd':
                             content_change = True
                         else:
@@ -1901,47 +1921,76 @@
                         target_exec = False
                     else:
                         raise Exception, "unknown kind %s" % path_info[2]
+                if source_minikind == 'd':
+                    if path is None:
+                        old_path = path = pathjoin(old_dirname, old_basename)
+                    old_dirname_to_file_id[old_path] = file_id
                 # parent id is the entry for the path in the target tree
                 if old_dirname == last_source_parent[0]:
                     source_parent_id = last_source_parent[1]
                 else:
-                    source_parent_entry = state._get_entry(source_index,
-                                                           path_utf8=old_dirname)
-                    source_parent_id = source_parent_entry[0][2]
+                    try:
+                        source_parent_id = old_dirname_to_file_id[old_dirname]
+                    except KeyError:
+                        source_parent_entry = state._get_entry(source_index,
+                                                               path_utf8=old_dirname)
+                        source_parent_id = source_parent_entry[0][2]
                     if source_parent_id == entry[0][2]:
                         # This is the root, so the parent is None
                         source_parent_id = None
                     else:
                         last_source_parent[0] = old_dirname
                         last_source_parent[1] = source_parent_id
-                        last_source_parent[2] = source_parent_entry
                 new_dirname = entry[0][0]
                 if new_dirname == last_target_parent[0]:
                     target_parent_id = last_target_parent[1]
                 else:
-                    # TODO: We don't always need to do the lookup, because the
-                    #       parent entry will be the same as the source entry.
-                    target_parent_entry = state._get_entry(target_index,
-                                                           path_utf8=new_dirname)
-                    assert target_parent_entry != (None, None), (
-                        "Could not find target parent in wt: %s\nparent of: %s"
-                        % (new_dirname, entry))
-                    target_parent_id = target_parent_entry[0][2]
+                    try:
+                        target_parent_id = new_dirname_to_file_id[new_dirname]
+                    except KeyError:
+                        # TODO: We don't always need to do the lookup, because the
+                        #       parent entry will be the same as the source entry.
+                        target_parent_entry = state._get_entry(target_index,
+                                                               path_utf8=new_dirname)
+                        assert target_parent_entry != (None, None), (
+                            "Could not find target parent in wt: %s\nparent of: %s"
+                            % (new_dirname, entry))
+                        target_parent_id = target_parent_entry[0][2]
                     if target_parent_id == entry[0][2]:
                         # This is the root, so the parent is None
                         target_parent_id = None
                     else:
                         last_target_parent[0] = new_dirname
                         last_target_parent[1] = target_parent_id
-                        last_target_parent[2] = target_parent_entry
 
                 source_exec = source_details[3]
-                return ((entry[0][2], (old_path, path), content_change,
-                        (True, True),
-                        (source_parent_id, target_parent_id),
-                        (old_basename, entry[0][1]),
-                        (_minikind_to_kind[source_minikind], target_kind),
-                        (source_exec, target_exec)),)
+                if (include_unchanged
+                    or content_change
+                    or source_parent_id != target_parent_id
+                    or old_basename != entry[0][1]
+                    or source_exec != target_exec
+                    ):
+                    if old_path is None:
+                        old_path = path = pathjoin(old_dirname, old_basename)
+                        old_path_u = utf8_decode(old_path)[0]
+                        path_u = old_path_u
+                    else:
+                        old_path_u = utf8_decode(old_path)[0]
+                        if old_path == path:
+                            path_u = old_path_u
+                        else:
+                            path_u = utf8_decode(path)[0]
+                    source_kind = _minikind_to_kind[source_minikind]
+                    return (entry[0][2],
+                           (old_path_u, path_u),
+                           content_change,
+                           (True, True),
+                           (source_parent_id, target_parent_id),
+                           (utf8_decode(old_basename)[0], utf8_decode(entry[0][1])[0]),
+                           (source_kind, target_kind),
+                           (source_exec, target_exec))
+                else:
+                    return uninteresting
             elif source_minikind in 'a' and target_minikind in 'fdlt':
                 # looks like a new file
                 if path_info is not None:
@@ -1960,12 +2009,14 @@
                             and stat.S_IEXEC & path_info[3].st_mode)
                     else:
                         target_exec = target_details[3]
-                    return ((entry[0][2], (None, path), True,
-                            (False, True),
-                            (None, parent_id),
-                            (None, entry[0][1]),
-                            (None, path_info[2]),
-                            (None, target_exec)),)
+                    return (entry[0][2],
+                           (None, utf8_decode(path)[0]),
+                           True,
+                           (False, True),
+                           (None, parent_id),
+                           (None, utf8_decode(entry[0][1])[0]),
+                           (None, path_info[2]),
+                           (None, target_exec))
                 else:
                     # but its not on disk: we deliberately treat this as just
                     # never-present. (Why ?! - RBC 20070224)
@@ -1980,12 +2031,14 @@
                 parent_id = state._get_entry(source_index, path_utf8=entry[0][0])[0][2]
                 if parent_id == entry[0][2]:
                     parent_id = None
-                return ((entry[0][2], (old_path, None), True,
-                        (True, False),
-                        (parent_id, None),
-                        (entry[0][1], None),
-                        (_minikind_to_kind[source_minikind], None),
-                        (source_details[3], None)),)
+                return (entry[0][2],
+                       (utf8_decode(old_path)[0], None),
+                       True,
+                       (True, False),
+                       (parent_id, None),
+                       (utf8_decode(entry[0][1])[0], None),
+                       (_minikind_to_kind[source_minikind], None),
+                       (source_details[3], None))
             elif source_minikind in 'fdlt' and target_minikind in 'r':
                 # a rename; could be a true rename, or a rename inherited from
                 # a renamed parent. TODO: handle this efficiently. Its not
@@ -2003,7 +2056,7 @@
                     "source_minikind=%r, target_minikind=%r"
                     % (source_minikind, target_minikind))
                 ## import pdb;pdb.set_trace()
-            return ()
+            return None
 
         while search_specific_files:
             # TODO: the pending list should be lexically sorted?  the
@@ -2039,30 +2092,11 @@
                 continue
             path_handled = False
             for entry in root_entries:
-                for result in _process_entry(entry, root_dir_info):
-                    # this check should probably be outside the loop: one
-                    # 'iterate two trees' api, and then _iter_changes filters
-                    # unchanged pairs. - RBC 20070226
+                result = _process_entry(entry, root_dir_info)
+                if result is not None:
                     path_handled = True
-                    if (include_unchanged
-                        or result[2]                    # content change
-                        or result[3][0] != result[3][1] # versioned status
-                        or result[4][0] != result[4][1] # parent id
-                        or result[5][0] != result[5][1] # name
-                        or result[6][0] != result[6][1] # kind
-                        or result[7][0] != result[7][1] # executable
-                        ):
-                        yield (result[0],
-                               (utf8_decode_or_none(result[1][0]),
-                                utf8_decode_or_none(result[1][1])),
-                               result[2],
-                               result[3],
-                               result[4],
-                               (utf8_decode_or_none(result[5][0]),
-                                utf8_decode_or_none(result[5][1])),
-                               result[6],
-                               result[7],
-                              )
+                    if result is not uninteresting:
+                        yield result
             if want_unversioned and not path_handled and root_dir_info:
                 new_executable = bool(
                     stat.S_ISREG(root_dir_info[3].st_mode)
@@ -2145,11 +2179,11 @@
                                         stat.S_ISREG(current_path_info[3].st_mode)
                                         and stat.S_IEXEC & current_path_info[3].st_mode)
                                     yield (None,
-                                        (None, utf8_decode_or_none(current_path_info[0])),
+                                        (None, utf8_decode(current_path_info[0])[0]),
                                         True,
                                         (False, False),
                                         (None, None),
-                                        (None, utf8_decode_or_none(current_path_info[1])),
+                                        (None, utf8_decode(current_path_info[1])[0]),
                                         (None, current_path_info[2]),
                                         (None, new_executable))
                                 # dont descend into this unversioned path if it is
@@ -2176,29 +2210,10 @@
                         for current_entry in current_block[1]:
                             # entry referring to file not present on disk.
                             # advance the entry only, after processing.
-                            for result in _process_entry(current_entry, None):
-                                # this check should probably be outside the loop: one
-                                # 'iterate two trees' api, and then _iter_changes filters
-                                # unchanged pairs. - RBC 20070226
-                                if (include_unchanged
-                                    or result[2]                    # content change
-                                    or result[3][0] != result[3][1] # versioned status
-                                    or result[4][0] != result[4][1] # parent id
-                                    or result[5][0] != result[5][1] # name
-                                    or result[6][0] != result[6][1] # kind
-                                    or result[7][0] != result[7][1] # executable
-                                    ):
-                                    yield (result[0],
-                                           (utf8_decode_or_none(result[1][0]),
-                                            utf8_decode_or_none(result[1][1])),
-                                           result[2],
-                                           result[3],
-                                           result[4],
-                                           (utf8_decode_or_none(result[5][0]),
-                                            utf8_decode_or_none(result[5][1])),
-                                           result[6],
-                                           result[7],
-                                          )
+                            result = _process_entry(current_entry, None)
+                            if result is not None:
+                                if result is not uninteresting:
+                                    yield result
                         block_index +=1
                         if (block_index < len(state._dirblocks) and
                             osutils.is_inside(current_root,
@@ -2233,29 +2248,10 @@
                         pass
                     elif current_path_info is None:
                         # no path is fine: the per entry code will handle it.
-                        for result in _process_entry(current_entry, current_path_info):
-                            # this check should probably be outside the loop: one
-                            # 'iterate two trees' api, and then _iter_changes filters
-                            # unchanged pairs. - RBC 20070226
-                            if (include_unchanged
-                                or result[2]                    # content change
-                                or result[3][0] != result[3][1] # versioned status
-                                or result[4][0] != result[4][1] # parent id
-                                or result[5][0] != result[5][1] # name
-                                or result[6][0] != result[6][1] # kind
-                                or result[7][0] != result[7][1] # executable
-                                ):
-                                yield (result[0],
-                                       (utf8_decode_or_none(result[1][0]),
-                                        utf8_decode_or_none(result[1][1])),
-                                       result[2],
-                                       result[3],
-                                       result[4],
-                                       (utf8_decode_or_none(result[5][0]),
-                                        utf8_decode_or_none(result[5][1])),
-                                       result[6],
-                                       result[7],
-                                      )
+                        result = _process_entry(current_entry, current_path_info)
+                        if result is not None:
+                            if result is not uninteresting:
+                                yield result
                     elif (current_entry[0][1] != current_path_info[1]
                           or current_entry[1][target_index][0] in 'ar'):
                         # The current path on disk doesn't match the dirblock
@@ -2270,55 +2266,17 @@
                         else:
                             # entry referring to file not present on disk.
                             # advance the entry only, after processing.
-                            for result in _process_entry(current_entry, None):
-                                # this check should probably be outside the loop: one
-                                # 'iterate two trees' api, and then _iter_changes filters
-                                # unchanged pairs. - RBC 20070226
-                                if (include_unchanged
-                                    or result[2]                    # content change
-                                    or result[3][0] != result[3][1] # versioned status
-                                    or result[4][0] != result[4][1] # parent id
-                                    or result[5][0] != result[5][1] # name
-                                    or result[6][0] != result[6][1] # kind
-                                    or result[7][0] != result[7][1] # executable
-                                    ):
-                                    yield (result[0],
-                                           (utf8_decode_or_none(result[1][0]),
-                                            utf8_decode_or_none(result[1][1])),
-                                           result[2],
-                                           result[3],
-                                           result[4],
-                                           (utf8_decode_or_none(result[5][0]),
-                                            utf8_decode_or_none(result[5][1])),
-                                           result[6],
-                                           result[7],
-                                          )
+                            result = _process_entry(current_entry, None)
+                            if result is not None:
+                                if result is not uninteresting:
+                                    yield result
                             advance_path = False
                     else:
-                        for result in _process_entry(current_entry, current_path_info):
-                            # this check should probably be outside the loop: one
-                            # 'iterate two trees' api, and then _iter_changes filters
-                            # unchanged pairs. - RBC 20070226
+                        result = _process_entry(current_entry, current_path_info)
+                        if result is not None:
                             path_handled = True
-                            if (include_unchanged
-                                or result[2]                    # content change
-                                or result[3][0] != result[3][1] # versioned status
-                                or result[4][0] != result[4][1] # parent id
-                                or result[5][0] != result[5][1] # name
-                                or result[6][0] != result[6][1] # kind
-                                or result[7][0] != result[7][1] # executable
-                                ):
-                                yield (result[0],
-                                       (utf8_decode_or_none(result[1][0]),
-                                        utf8_decode_or_none(result[1][1])),
-                                       result[2],
-                                       result[3],
-                                       result[4],
-                                       (utf8_decode_or_none(result[5][0]),
-                                        utf8_decode_or_none(result[5][1])),
-                                       result[6],
-                                       result[7],
-                                      )
+                            if result is not uninteresting:
+                                yield result
                     if advance_entry and current_entry is not None:
                         entry_index += 1
                         if entry_index < len(current_block[1]):
@@ -2335,11 +2293,11 @@
                                     stat.S_ISREG(current_path_info[3].st_mode)
                                     and stat.S_IEXEC & current_path_info[3].st_mode)
                                 yield (None,
-                                    (None, utf8_decode_or_none(current_path_info[0])),
+                                    (None, utf8_decode(current_path_info[0])[0]),
                                     True,
                                     (False, False),
                                     (None, None),
-                                    (None, utf8_decode_or_none(current_path_info[1])),
+                                    (None, utf8_decode(current_path_info[1])[0]),
                                     (None, current_path_info[2]),
                                     (None, new_executable))
                             # dont descend into this unversioned path if it is




More information about the bazaar-commits mailing list