Rev 20: More C tweaks from John. in http://people.ubuntu.com/~robertc/baz2.0/plugins/groupcompress/trunk

Fri Jul 25 05:58:27 BST 2008

At http://people.ubuntu.com/~robertc/baz2.0/plugins/groupcompress/trunk

------------------------------------------------------------
revno: 20
revision-id: robertc at robertcollins.net-20080725045827-ixxmnk4cp7cmzad9
parent: robertc at robertcollins.net-20080725031326-2c1xq6xusfoefxsi
parent: john at arbash-meinel.com-20080725032223-uz2pbbqz5ez2t3mi
committer: Robert Collins <robertc at robertcollins.net>
branch nick: trunk
timestamp: Fri 2008-07-25 14:58:27 +1000
message:
  More C tweaks from John.
modified:
  _groupcompress_c.pyx           _groupcompress_c.pyx-20080724041824-yelg6ii7c7zxt4z0-1
  groupcompress.py               groupcompress.py-20080705181503-ccbxd6xuy1bdnrpu-8
  tests/test__groupcompress_c.py test__groupcompress_-20080724145854-koifwb7749cfzrvj-1
    ------------------------------------------------------------
    revno: 14.1.37
    revision-id: john at arbash-meinel.com-20080725032223-uz2pbbqz5ez2t3mi
    parent: john at arbash-meinel.com-20080725025005-j85yzscr9r4121q3
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: groupcompress
    timestamp: Thu 2008-07-24 22:22:23 -0500
    message:
      It turns out that appending a short set of lines every time was killing performance in realloc.
      So fudge a bit and allocate memory up to the next power of 2.
      This may be a bit overzealous, but we do need something to prevent a realloc for every extend.
    modified:
      _groupcompress_c.pyx           _groupcompress_c.pyx-20080724041824-yelg6ii7c7zxt4z0-1
    ------------------------------------------------------------
    revno: 14.1.36
    revision-id: john at arbash-meinel.com-20080725025005-j85yzscr9r4121q3
    parent: john at arbash-meinel.com-20080725023821-cv0jl06r6xjuwxje
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: groupcompress
    timestamp: Thu 2008-07-24 21:50:05 -0500
    message:
      Small tweak makes a big difference on inventory.py, minor otherwise.
      
      Return the incremented positions, and indicate if we know that the next line doesn't match
      anything.
    modified:
      _groupcompress_c.pyx           _groupcompress_c.pyx-20080724041824-yelg6ii7c7zxt4z0-1
      groupcompress.py               groupcompress.py-20080705181503-ccbxd6xuy1bdnrpu-8
    ------------------------------------------------------------
    revno: 14.1.35
    revision-id: john at arbash-meinel.com-20080725023821-cv0jl06r6xjuwxje
    parent: john at arbash-meinel.com-20080725022704-01j35xm7z3k98qit
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: groupcompress
    timestamp: Thu 2008-07-24 21:38:21 -0500
    message:
      remove the timing calls
    modified:
      groupcompress.py               groupcompress.py-20080705181503-ccbxd6xuy1bdnrpu-8
    ------------------------------------------------------------
    revno: 14.1.34
    revision-id: john at arbash-meinel.com-20080725022704-01j35xm7z3k98qit
    parent: john at arbash-meinel.com-20080725014613-6pe4o1mjuhajfkwb
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: groupcompress
    timestamp: Thu 2008-07-24 21:27:04 -0500
    message:
      Doing the intersection as we go allows us to short-cut a bit more.
      
      Changes inventory.py from (2.4 => 2.0 => 1.7s)
      builtins.py from (113 => 84 => 73s)
    modified:
      _groupcompress_c.pyx           _groupcompress_c.pyx-20080724041824-yelg6ii7c7zxt4z0-1
      tests/test__groupcompress_c.py test__groupcompress_-20080724145854-koifwb7749cfzrvj-1
=== modified file '_groupcompress_c.pyx'

--- a/_groupcompress_c.pyx	2008-07-25 03:13:26 +0000
+++ b/_groupcompress_c.pyx	2008-07-25 04:58:27 +0000
@@ -264,14 +264,18 @@
         cdef Py_ssize_t line_index
         cdef _raw_line *cur_line
         cdef PyObject *seq_index
+        cdef Py_ssize_t actual_new_len
 
         seq_index = PySequence_Fast(index, "expected a sequence for index")
         try:
             old_len = self._len_lines
             new_count = PySequence_Fast_GET_SIZE(seq_index) 
             new_total_len = new_count + self._len_lines
+            actual_new_len = 1
+            while actual_new_len < new_total_len:
+                actual_new_len = actual_new_len << 1
             self._raw_lines = <_raw_line*>safe_realloc(<void*>self._raw_lines,
-                                    new_total_len * sizeof(_raw_line))
+                                    actual_new_len * sizeof(_raw_line))
             self._len_lines = new_total_len
             # Now that we have enough space, start adding the new lines
             # into the array. These are done in forward order.
@@ -452,6 +456,68 @@
             cur_line_idx = self._raw_lines[cur_line_idx].next_line_index
         return cur_bucket.count
 
+    cdef Py_ssize_t get_next_matches(self, PyObject *line,
+                                     Py_ssize_t *tips,
+                                     Py_ssize_t tip_count,
+                                     int *did_match) except -1:
+        """Find matches that are interesting for the next step.
+
+        This finds offsets which match the current search tips. It is slightly
+        different from get_raw_matches, in that it intersects matching lines
+        against the existing tips, and then returns the *next* line.
+
+        TODO: It might be nice to know if there were 0 matches because the line
+              didn't match anything, or if it was just because it didn't follow
+              anything.
+
+        :param line: A line to match against
+        :param tips: A list of matches that we already have. This list
+            will be updated "in place". If there are no new matches, the list
+            will go unmodified, and the returned count will be 0.
+        :param tip_count: The current length of tips
+        :param did_match: Will be set to True if the line had matches, else 0
+        :return: The new number of matched lines.
+        """
+        cdef Py_ssize_t hash_offset
+        cdef _raw_line raw_line
+        cdef _hash_bucket cur_bucket
+        cdef Py_ssize_t cur_line_idx
+        cdef Py_ssize_t count
+        cdef Py_ssize_t *cur_tip
+        cdef Py_ssize_t *end_tip
+        cdef Py_ssize_t *cur_out
+
+        self._line_to_raw_line(line, &raw_line)
+        hash_offset = self._find_hash_position(&raw_line)
+        cur_bucket = self._hashtable[hash_offset]
+        cur_line_idx = cur_bucket.line_index
+        if cur_line_idx == SENTINEL:
+            did_match[0] = 0
+            return 0
+
+        did_match[0] = 1
+
+        cur_tip = tips
+        end_tip = tips + tip_count
+        cur_out = tips
+        count = 0
+        while cur_tip < end_tip and cur_line_idx != SENTINEL:
+            if cur_line_idx == cur_tip[0]:
+                # These match, so put a new entry in the output
+                # We put the *next* line, so that this is ready to pass back in
+                # for the next search.
+                cur_out[0] = (cur_line_idx + 1)
+                count = count + 1
+                cur_out = cur_out + 1
+                cur_tip = cur_tip + 1
+                cur_line_idx = self._raw_lines[cur_line_idx].next_line_index
+            elif cur_line_idx < cur_tip[0]:
+                # cur_line_idx is "behind"
+                cur_line_idx = self._raw_lines[cur_line_idx].next_line_index
+            else:
+                cur_tip = cur_tip + 1
+        return count
+
     def _get_matching_lines(self):
         """Return a dictionary showing matching lines."""
         matching = {}
@@ -581,6 +647,7 @@
     cdef Py_ssize_t in_start
     cdef Py_ssize_t i
     cdef Py_ssize_t intersect_count
+    cdef int did_match
 
     eq = equivalence_table
     cpos = pos
@@ -593,60 +660,45 @@
     copy_count = 0
 
     # TODO: Handle when locations is not None
-    if locations is not None:
-        match_count = list_as_array(locations, &matches)
+    # if locations is not None:
+    #     copy_count = list_as_array(locations, &copy_ends)
+    #     # We know the match for this position
+    #     cpos = cpos + 1
+    #     range_len = 1
     while cpos < cmax_pos:
         # TODO: Instead of grabbing the full list of matches and then doing an
         #       intersection, use a helper that does the intersection
         #       as-it-goes.
-        if matches == NULL:
-            line = PyList_GET_ITEM(eq._right_lines, cpos)
-            safe_free(<void**>&matches)
-            match_count = eq.get_raw_matches(line, &matches)
-        if match_count == 0:
-            # No more matches, just return whatever we have, but we know that
-            # this last position is not going to match anything
-            cpos = cpos + 1
-            break
+        line = PyList_GET_ITEM(eq._right_lines, cpos)
+        if copy_ends == NULL:
+            copy_count = eq.get_raw_matches(line, &copy_ends)
+            if copy_count == 0:
+                # No matches for this line
+                cpos = cpos + 1
+                break
+            # point at the next location, for the next round trip
+            range_len = 1
+            for i from 0 <= i < copy_count:
+                copy_ends[i] = copy_ends[i] + 1
         else:
-            if copy_ends == NULL:
-                # We are starting a new range, just steal the matches
-                copy_ends = matches
-                copy_count = match_count
-                matches = NULL
-                match_count = 0
-                for i from 0 <= i < copy_count:
-                    copy_ends[i] = copy_ends[i] + 1
-                range_len = 1
+            # We are in the middle of a match
+            intersect_count = eq.get_next_matches(line, copy_ends, copy_count,
+                                                  &did_match)
+            if intersect_count == 0:
+                # No more matches
+                if not did_match:
+                    # because the next line is not interesting
+                    cpos = cpos + 1
+                break
             else:
-                # We are currently in the middle of a match
-                intersect_count = intersect_sorted(copy_ends, copy_count,
-                                                   matches, match_count)
-                if intersect_count == 0:
-                    # But we are done with this match, we should be
-                    # starting a new one, though. We will pass back 'locations'
-                    # so that we don't have to do another lookup.
-                    break
-                else:
-                    copy_count = intersect_count
-                    # range continues, step the copy ends
-                    for i from 0 <= i < copy_count:
-                        copy_ends[i] = copy_ends[i] + 1
-                    range_len = range_len + 1
-                    safe_free(<void **>&matches)
-                    match_count = 0
+                copy_count = intersect_count
+                range_len = range_len + 1
         cpos = cpos + 1
-    if matches == NULL:
-        # We consumed all of our matches
-        locations = None
-    else:
-        locations = array_as_list(matches, match_count)
-        safe_free(<void **>&matches)
     if copy_ends == NULL or copy_count == 0:
         # We never had a match
-        return None, cpos, locations
+        return None, cpos, None
     # Because copy_ends is always sorted, we know the 'min' is just the first
     # node
     in_start = copy_ends[0]
     safe_free(<void**>&copy_ends)
-    return ((in_start - range_len), range_start, range_len), cpos, locations
+    return ((in_start - range_len), range_start, range_len), cpos, None

=== modified file 'groupcompress.py'
--- a/groupcompress.py	2008-07-25 00:09:27 +0000
+++ b/groupcompress.py	2008-07-25 04:58:27 +0000
@@ -134,7 +134,6 @@
             of the list is always (old_len, new_len, 0) to provide a end point
             for generating instructions from the matching blocks list.
         """
-        import time
         result = []
         pos = 0
         line_locations = self.line_locations
@@ -143,25 +142,15 @@
         # insert new lines. To find reusable lines we traverse 
         locations = None
         max_pos = len(lines)
-        timer = time.clock
         max_time = 0.0
         max_info = None
+        result_append = result.append
         while pos < max_pos:
-            tstart = timer()
-            block, next_pos, locations = _get_longest_match(line_locations, pos,
-                                                            max_pos, locations)
-            tdelta = timer() - tstart
-            if tdelta > max_time:
-                max_time = tdelta
-                max_info = tdelta, pos, block, next_pos, locations
-            pos = next_pos
-            
+            block, pos, locations = _get_longest_match(line_locations, pos,
+                                                       max_pos, locations)
             if block is not None:
-                result.append(block)
-        result.append((len(self.lines), len(lines), 0))
-        # if max_time > 0.01:
-        #     print max_info[:-1]
-        #     import pdb; pdb.set_trace()
+                result_append(block)
+        result_append((len(self.lines), len(lines), 0))
         return result
 
     def compress(self, key, lines, expected_sha):

=== modified file 'tests/test__groupcompress_c.py'
--- a/tests/test__groupcompress_c.py	2008-07-25 01:34:28 +0000
+++ b/tests/test__groupcompress_c.py	2008-07-25 02:27:04 +0000
@@ -224,8 +224,12 @@
     def assertLongestMatch(self, block, end_pos, matching_locs,
                            eq, start_pos, max_len, known_locs):
         """Check that _get_longest_match gives correct results."""
-        val = self._longest_match(eq, start_pos, max_len, known_locs)
-        self.assertEqual((block, end_pos, matching_locs), val)
+        the_block, the_end_pos, next_locs = self._longest_match(eq,
+                                                start_pos, max_len, known_locs)
+        self.assertEqual(block, the_block)
+        self.assertEqual(end_pos, the_end_pos)
+        if next_locs is not None:
+            self.assertEqual(matching_locs, next_locs)
 
     def test_all_match(self):
         eq = self._eq_class(['a', 'b', 'c'])