[MERGE][1.0] update C PatienceDiff to allow for non-strings

Wed Dec 19 09:43:43 GMT 2007

bb:tweak

John Arbash Meinel wrote:
[...]
> So if we expect that most strings will be unique, it is reasonable to use djb2
> as the hash function. The big change is that it isn't much faster, but it does
> support arbitrary objects.
> 
> So anyway, I have an updated patch, but it is certainly less critical now. It
> will allow arbitrary objects, but the performance is pretty equal. (I do wonder
> if PyObject_Hash isn't the right way to go though. As there will be times when
> the hash can already be computed, or help in future comparisons.)

If the performance difference is small, I'd be tempted to just use PyObject_Hash
for simplicity's sake.  And as you say, sometimes the builtin caching of
str.__hash__ may benefit us.

[...]
> # Begin patch
> === modified file 'bzrlib/_patiencediff_c.c'
> --- bzrlib/_patiencediff_c.c	2007-09-04 09:10:35 +0000
> +++ bzrlib/_patiencediff_c.c	2007-12-05 14:09:08 +0000
> @@ -70,11 +70,11 @@
>  
>  
>  struct line {
> -    int hash;          /* hash code of the string */
> +    long hash;         /* hash code of the string/object */
>      Py_ssize_t next;   /* next line from the same equivalence class */
>      Py_ssize_t equiv;  /* equivalence class */
>      Py_ssize_t len;
> -    const char *data;
> +    const PyObject *data;
>  };
>  
>  
> @@ -151,8 +151,8 @@
>  static inline int
>  compare_lines(struct line *a, struct line *b)
>  {
> -    return ((a->hash != b->hash) || (a->len != b->len) ||
> -            memcmp(a->data, b->data, a->len));
> +    return ((a->hash != b->hash) || (a->len != b->len)
> +            || PyObject_Compare((PyObject *)a->data, (PyObject *)b->data));
>  }

What do you think of Lukas' idea of not declaring data to be const, so that you
can avoid the cast here?  It seems reasonable to me.

[...]
> +        line->data = item;
> +        if (PyString_Check(item)) {

I think you should use PyString_CheckExact here.  You're going to ignore any
behaviour that a str subclass would try to implement because you're using stuff
like PyString_AS_STRING, and it's faster.

> +            long hash;
> +            const char *p;
> +            /* we could use the 'djb2' hash here if we find it is better than
> +               PyObject_Hash, however it seems measurably slower to compute,
> +               and doesn't give particularly better results
> +             */
> +            line->len = PyString_GET_SIZE(item);
> +            p = PyString_AS_STRING(item);
> +            /* 'djb2' hash. This gives us a nice compromise between fast hash
> +                function and a hash with less collisions. The algorithm doesn't
> +                use the hash for actual lookups, only for building the table
> +                so a better hash function wouldn't bring us much benefit, but
> +                would make this loading code slower. */

As Lukas pointed out, the first comment here is now wrong.

> +            hash = 5381;
> +            for (j = 0; j < line->len; j++)
> +                hash = ((hash << 5) + hash) + *p++;
> +            line->hash = hash;
> +            // line->hash = PyObject_Hash(item);
> +        } else if (PyTuple_Check(item)) {

Use PyTuple_CheckExact here...

> +            total_len = 0;
> +            tuple_len = PyObject_Length(item);
> +            for (j = 0; j < tuple_len; ++j) {
> +                subitem = PySequence_Fast_GET_ITEM(item, j);

...so that we can use PySequence_Fast_GET_ITEM here safely.  A subclass could
lie about its length and cause this to crash, but PyTuple_CheckExact would
prevent subclasses from triggering this case.  (Alternatively you could use
PyObject_GetIter/PyIter_Next to safely iterate over the object, but I expect
that would be slower for plain tuples.)

Also, you may as well use PyTuple_GET_ITEM directly here.  (And technically
because you didn't get “item” from PySequence_Fast I don't think it's guaranteed
that PySequence_Fast_GET_ITEM should work with it.)

> +                cur_len = PyObject_Length(subitem);
> +                if (cur_len == -1) {
> +                    size = -1;
> +                    goto cleanup;
> +                }
> +                total_len += cur_len;
> +            }
> +            line->len = total_len;
> +            line->hash = PyObject_Hash(item);

I think this whole else-if block deserves a comment.  It's not clear to me why
you want to treat tuples differently to other objects.  Why is it correct to
recurse one level into tuples to calculate the length, but not any other kind of
sequence?

> +        } else {
> +            /* Generic length */
> +            cur_len = PyObject_Length(item);
> +            if (cur_len == -1) {
> +                size = -1;
> +                goto cleanup;
> +            }
> +            line->len = cur_len;
> +            line->hash = PyObject_Hash(item);
> +        }

> +        if (line->hash == (-1)) {
> +            /* Propogate the hash exception */
> +            size = -1;
> +            goto cleanup;
> +        }

As Lukas pointed out, this isn't safe with djb2 hash.

You could solve this the same way Python's string_hash function does:

	if (x == -1)
		x = -2;

Or just use PyObject_Hash always, and not bother with djb2.  ;)

>          line->next = SENTINEL;
>          line++;
>      }
>  
> +    cleanup:
>      Py_DECREF(seq);
>      return size;
>  }
> 
> === modified file 'bzrlib/tests/test_diff.py'
> --- bzrlib/tests/test_diff.py	2007-11-25 21:52:24 +0000
> +++ bzrlib/tests/test_diff.py	2007-12-04 16:28:28 +0000
> @@ -800,6 +800,38 @@
>          chk_blocks('abbabbbb', 'cabbabbc', [])
>          chk_blocks('bbbbbbbb', 'cbbbbbbc', [])
>  
> +    def test_matching_blocks_tuples(self):
> +        def chk_blocks(a, b, expected_blocks):
> +            # difflib always adds a signature of the total
> +            # length, with no matching entries at the end
> +            s = self._PatienceSequenceMatcher(None, a, b)
> +            blocks = s.get_matching_blocks()
> +            self.assertEquals((len(a), len(b), 0), blocks[-1])
> +            self.assertEquals(expected_blocks, blocks[:-1])

This chk_blocks helper is repeated a couple of times already.  Rather
duplicating it again, make it a method of TestPatienceDiffLib.  I'd call it
“assertDiffBlocks” to be consistent with the convention for custom assertion
methods, but I don't mind if you keep calling it “chk_blocks”.

[...]
> +    def test_unhashable(self):
> +        """We should get a proper exception here."""
> +        # A sub-list is unhashable
> +        e = self.assertRaises(TypeError, self._PatienceSequenceMatcher,
> +                                         None, [[]], [])

I'd also like to see the case where an element other than the first is
unhashable.

> +    def test_no_length(self):
> +        # Objects have no length
> +        e = self.assertRaises(TypeError, self._PatienceSequenceMatcher,
> +                                         None, [object()], [])

The comment here is only half-written.  “Objects have no length, so they are not
diffable.”  The same applies to test_unhashable.  It may feel a bit like stating
the obvious, but I prefer to err on the side of explicitness when it comes to
tests.  I've already had to guess at the intent of too many tests in my short
lifetime :)

Also, that comment ought to be a docstring I think.

-Andrew.