[MERGE][1.0] update C PatienceDiff to allow for non-strings

Wed Dec 19 16:44:46 GMT 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andrew Bennetts wrote:
> bb:tweak
> 
> John Arbash Meinel wrote:
> [...]
>> So if we expect that most strings will be unique, it is reasonable to use djb2
>> as the hash function. The big change is that it isn't much faster, but it does
>> support arbitrary objects.
>>
>> So anyway, I have an updated patch, but it is certainly less critical now. It
>> will allow arbitrary objects, but the performance is pretty equal. (I do wonder
>> if PyObject_Hash isn't the right way to go though. As there will be times when
>> the hash can already be computed, or help in future comparisons.)
> 
> If the performance difference is small, I'd be tempted to just use PyObject_Hash
> for simplicity's sake.  And as you say, sometimes the builtin caching of
> str.__hash__ may benefit us.

It was ~25% of the total time to diff all entries. So not huge, but certainly
measurable. I'm still willing to switch, though, as it means I can leave my
comment in, and I think simplicity is good here.

...
> What do you think of Lukas' idea of not declaring data to be const, so that you
> can avoid the cast here?  It seems reasonable to me.

Done.

> 
> [...]
>> +        line->data = item;
>> +        if (PyString_Check(item)) {
> 
> I think you should use PyString_CheckExact here.  You're going to ignore any
> behaviour that a str subclass would try to implement because you're using stuff
> like PyString_AS_STRING, and it's faster.

As I've gone with PyObject_Hash all around, I don't need to check explicitly
for PyString.

...

>> +        } else if (PyTuple_Check(item)) {
> 
> Use PyTuple_CheckExact here...

Done.

> 
>> +            total_len = 0;
>> +            tuple_len = PyObject_Length(item);
>> +            for (j = 0; j < tuple_len; ++j) {
>> +                subitem = PySequence_Fast_GET_ITEM(item, j);
> 
> ...so that we can use PySequence_Fast_GET_ITEM here safely.  A subclass could
> lie about its length and cause this to crash, but PyTuple_CheckExact would
> prevent subclasses from triggering this case.  (Alternatively you could use
> PyObject_GetIter/PyIter_Next to safely iterate over the object, but I expect
> that would be slower for plain tuples.)
> 
> Also, you may as well use PyTuple_GET_ITEM directly here.  (And technically
> because you didn't get “item” from PySequence_Fast I don't think it's guaranteed
> that PySequence_Fast_GET_ITEM should work with it.)

Done. I think I was just re-using the PySequence get from earlier.

> 
>> +                cur_len = PyObject_Length(subitem);
>> +                if (cur_len == -1) {
>> +                    size = -1;
>> +                    goto cleanup;
>> +                }
>> +                total_len += cur_len;
>> +            }
>> +            line->len = total_len;
>> +            line->hash = PyObject_Hash(item);
> 
> I think this whole else-if block deserves a comment.  It's not clear to me why
> you want to treat tuples differently to other objects.  Why is it correct to
> recurse one level into tuples to calculate the length, but not any other kind of
> sequence?

Well, you have "PyString" which you *really* don't want to recurse into (since
all nodes would be length one). You have PyList which can't be used because it
is unhashable. PyTuple we're handling. Are there other sequences of sequences
that we would want to support.

Supporting Tuples and Strings seemed enough to me. Even at this we wouldn't
*have* to recurse, as it would still give correct results. But it seems a lot
more useful to have a bit of 'len_a == len_b' filtering. One could also argue
that just comparing the hash would be enough.

I think the old code needed to track 'len' because it needed it anyway for the
"memcmp" step. Since I switched to PyObject_Compare, we could get rid of it
entirely and save a bit of memory overhead, and a comparison per step. I'll
play around with it.

		Unique	Repeat
bzr.dev  	4.00s  	3.92s
current		5.31s	2.76s

The code is simplified a lot, and is faster when the strings are repeated.

The new inner loop is just a simple:
    for (i = 0; i < size; i++) {
        item = PySequence_Fast_GET_ITEM(seq, i);
        line->data = item;
        line->hash = PyObject_Hash(item);
        if (line->hash == (-1)) {
            /* Propogate the hash exception */
            size = -1;
            goto cleanup;
        }
        line->next = SENTINEL;
        line++;
    }

> 
>> +        } else {
>> +            /* Generic length */
>> +            cur_len = PyObject_Length(item);
>> +            if (cur_len == -1) {
>> +                size = -1;
>> +                goto cleanup;
>> +            }
>> +            line->len = cur_len;
>> +            line->hash = PyObject_Hash(item);
>> +        }
> 
>> +        if (line->hash == (-1)) {
>> +            /* Propogate the hash exception */
>> +            size = -1;
>> +            goto cleanup;
>> +        }
> 
> As Lukas pointed out, this isn't safe with djb2 hash.
> 
> You could solve this the same way Python's string_hash function does:
> 
> 	if (x == -1)
> 		x = -2;
> 
> Or just use PyObject_Hash always, and not bother with djb2.  ;)

Which is the route I went.

> 
>>          line->next = SENTINEL;
>>          line++;
>>      }
>>  
>> +    cleanup:
>>      Py_DECREF(seq);
>>      return size;
>>  }
>>
>> === modified file 'bzrlib/tests/test_diff.py'
>> --- bzrlib/tests/test_diff.py	2007-11-25 21:52:24 +0000
>> +++ bzrlib/tests/test_diff.py	2007-12-04 16:28:28 +0000
>> @@ -800,6 +800,38 @@
>>          chk_blocks('abbabbbb', 'cabbabbc', [])
>>          chk_blocks('bbbbbbbb', 'cbbbbbbc', [])
>>  
>> +    def test_matching_blocks_tuples(self):
>> +        def chk_blocks(a, b, expected_blocks):
>> +            # difflib always adds a signature of the total
>> +            # length, with no matching entries at the end
>> +            s = self._PatienceSequenceMatcher(None, a, b)
>> +            blocks = s.get_matching_blocks()
>> +            self.assertEquals((len(a), len(b), 0), blocks[-1])
>> +            self.assertEquals(expected_blocks, blocks[:-1])
> 
> This chk_blocks helper is repeated a couple of times already.  Rather
> duplicating it again, make it a method of TestPatienceDiffLib.  I'd call it
> “assertDiffBlocks” to be consistent with the convention for custom assertion
> methods, but I don't mind if you keep calling it “chk_blocks”.
> 

Well, the other reason was because all the tests get a nice short function to
use. I'll do a bit of refactoring.

> [...]
>> +    def test_unhashable(self):
>> +        """We should get a proper exception here."""
>> +        # A sub-list is unhashable
>> +        e = self.assertRaises(TypeError, self._PatienceSequenceMatcher,
>> +                                         None, [[]], [])
> 
> I'd also like to see the case where an element other than the first is
> unhashable.

Done. Both elements after the first, and elements in the second sequence.

> 
>> +    def test_no_length(self):
>> +        # Objects have no length
>> +        e = self.assertRaises(TypeError, self._PatienceSequenceMatcher,
>> +                                         None, [object()], [])
> 
> The comment here is only half-written.  “Objects have no length, so they are not
> diffable.”  The same applies to test_unhashable.  It may feel a bit like stating
> the obvious, but I prefer to err on the side of explicitness when it comes to
> tests.  I've already had to guess at the intent of too many tests in my short
> lifetime :)
> 
> Also, that comment ought to be a docstring I think.
> 
> -Andrew.

Since I switched, we don't need len() anymore. But we still need hash()
(otherwise we can't build up a hash table of possible matches).

But I clarified the comment.

Since you voted tweak, I will submit the updated version. But if you have any
feedback, I'm happy to address it.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHaUp9JdeBCYSNAAMRAuGdAKCfsI9FyT75YT6YGbie1Pc5JoSm7ACg0GPN
m7+OUTWzvg5zjI45C2AWpGA=
=k9H4
-----END PGP SIGNATURE-----