[MERGE][RFC] C implementation of PatienceSequenceMatcher

Fri Aug 3 07:12:39 BST 2007

!comment

Hmm.  C code really is much harder to read and trust than Python code.  So I'm
pretty hesitant to say we should have this.  I'm not against it, but in my
opinion the benefits have to be pretty large... the burden of this code is much
larger than most of our code, just because it's C.

It is actually pretty good as far as C code goes, though!  It is pedantic about
checking return values, and generally it's fairly straightforward to follow how
the memory allocations and refcounts go, even in the error-handling.  There's a
lot of mildly obscure loops though, particularly ones doing stuff like:

> +        for (j = a[i].hash & hsize; hashtable[j].bhead != -1; j = (j + 1) & hsize)

or:

> +    for (i = bsize - 1; i >= 0; i--) {

Which is simple enough, but you're left wondering why it iterates backwards (and
the comments don't explain why).

Probably this code needs commenting even more thoroughly than C code usually
needs, because this community doesn't have much C expertise.

I guess my main worry is that the intersection of people with good C skills and
with a good understanding of patience diff internals is pretty tiny.  So if
something goes wrong, we may struggle to find the time to properly investigate
and fix it.

I also think this code ought to use Python's dictionary type rather than
building its own hashmap datastructure.  A good chunk of the hairiest code would
be unnecessary if it used PyDicts instead.  So I'd be very interested in hearing
if this code still performs as well with PyDicts.

So I'm not really voting on this yet, because I think we need a broader-than-
usual consensus to merge this.

On to the review...

Lukáš Lalinský wrote:
[...]
> +enum {
> +    OP_EQUAL = 0,
> +    OP_INSERT,
> +    OP_DELETE,
> +    OP_REPLACE
> +};
> +
> +
> +static char *opcode_names[] = {
> +    "equal",
> +    "insert",
> +    "delete",
> +    "replace",
> +};

You should have a comment explicitly saying that the enum values need to
correspond to the order of the names in the opcode_names array.

[...]
> +struct bucket {
> +    Py_ssize_t ahead;  /* first item in `a` from this equivalence class */
> +    Py_ssize_t an;
> +    Py_ssize_t bhead;  /* first item in `b` from this equivalence class */
> +    Py_ssize_t bn;
> +};

I'd be tempted to call those "a_head" and "b_head", just to avoid the confusion
of looking like the word "ahead".  But that's probably just me :)

[...]
> +static inline Py_ssize_t
> +bisect_left(Py_ssize_t *list, Py_ssize_t item, Py_ssize_t lo, Py_ssize_t hi)
> +{
> +    while (lo < hi) {
> +        Py_ssize_t mid = (lo + hi) / 2;
> +        if (list[mid] < item)
> +            lo = mid + 1;
> +        else
> +            hi = mid;
> +    }
> +    return lo;
> +}

In recent Pythons, the standard library's bisect module is implemented in C.
Would it be a significant performance hit (or a significant inconvenience) to
reuse that?  The obvious downside is that those bisect functions work on
PyObjects, rather than on Py_ssize_t.

Oh, and this code might overflow if (lo + hi) is too big... I guess we probably
have other problems to worry about before we can cope with datasets that large
though! :)

[...]
> +static struct bucket *
> +equate_lines(struct line *a, struct line *b, Py_ssize_t asize, Py_ssize_t bsize)
> +{
> +    Py_ssize_t i, j, hsize;
> +    struct bucket *hashtable;
> +
> +    /* build a hash table of the next highest power of 2 */
> +    hsize = 1;
> +    while (hsize < bsize + 1)
> +        hsize *= 2;

Hmm.  Is it really worth implementing your own hash table?  Python dictionaries
are pretty heavily optimised, even though they do use PyObjects.  I'm pretty
reluctant to maintain our own code for this when Python already has very good
(and heavily tested!) code in this area.  Could you try using Python dict
objects here, and see what the impact is (both on performance, and on complexity
of your own code)?

(And again, there's a risk of overflow here!)

> +
> +    hashtable = (struct bucket *)malloc(sizeof(struct bucket) * hsize);
> +    if (hashtable == NULL) {
> +        PyErr_NoMemory();
> +        return NULL;
> +    }
> +
> +    for (i = 0; i < hsize; i++) {
> +        hashtable[i].an = 0;
> +        hashtable[i].bn = 0;
> +        hashtable[i].ahead = -1;
> +        hashtable[i].bhead = -1;
> +    }
> +    hsize--;
> +
> +    /* add lines from b to the hash table chains */
> +    for (i = bsize - 1; i >= 0; i--) {
> +        /* find the first h entry, which is either empty or contains
> +           the same line as b[i] */
> +        for (j = b[i].hash & hsize; hashtable[j].bhead != -1; j = (j + 1) & hsize)
> +            if (!compare_lines(b + i, b + hashtable[j].bhead))
> +                break;

Oh boy :)

This is exactly the sort of code I hope we can avoid by using Python dicts :)

I don't do much C, but I'd be more comfortable with braces in those for and if
statements, to discourage easy mistakes...

[...]
> +        /* find the first h entry, which is either empty or contains
> +           the same line as a[i] */
> +        for (j = a[i].hash & hsize; hashtable[j].bhead != -1; j = (j + 1) & hsize)
> +            if (!compare_lines(a + i, b + hashtable[j].bhead))
> +                break;

Hmm, this duplicated logic is unfortunate.  If we do keep it, I'd like to see it
factored out somehow... code this complex should only exist in one place!

[...]
> +/* Finds longest common subsequence of unique lines in a[alo:ahi] and
> +   b[blo:bhi].
> +   Parameter backpointers must have allocated memory for at least
> +   4 * (bhi - blo) ints. */

I wouldn't mind a comment somewhere (perhaps inside the function rather than
right here) elaborating on how you arrive at that 4 * (bhi - blo) figure.  I
think I can see why, but some explicit reasoning wouldn't hurt.

> +Py_ssize_t
> +unique_lcs(struct matching_line *answer, struct bucket *h, Py_ssize_t *backpointers,
> +           struct line *a, struct line *b, Py_ssize_t alo, Py_ssize_t blo, Py_ssize_t ahi, Py_ssize_t bhi)
> +{
> +    Py_ssize_t i, j, k, e, apos, bpos, napos, nbpos, bsize, stacksize;
> +    Py_ssize_t *stacks, *lasts, *btoa;
> +
> +    k = 0;
> +    stacksize = 0;
> +    bsize = bhi - blo;
> +
> +    stacks = backpointers + bsize;
> +    lasts = stacks + bsize;
> +    btoa = lasts + bsize;

Some explanation of these variables' roles would be good.

"stacks" and "lasts" have rather different names, but looking at the code the
difference seems to be "stacks is about lines from a" and "lasts is about lines
from b".  I guess this terminology is used in the description of the patience
diff algorithm?

The unusual way you use backpointers for four separate arrays deserves some
comments too.

> +
> +    for (i = 0; i < bsize; i++)
> +        backpointers[i] = -1;

A brief comment before this for-loop, along the lines of "initialise all the
backpointers to -1", would be good.

Why don't you initialise stacks, lasts and btoa as well?

> +
> +    for (bpos = blo; bpos < bhi; bpos++) {
> +        e = b[bpos].equiv;
> +
> +        /* no lines in a or b */
> +        if (!h[e].an || !h[e].bn)
> +            continue;

Ouch, that's a bit too concise.  "h[e].an"?  Please use longer identifiers :)

> +
> +        /* find an unique line in a, that matches b[bpos] */
> +        apos = -1;
> +        for (j = h[e].ahead; j != -1; j = a[j].next)
> +            if (alo <= j && j < ahi) {
> +                if (apos != -1)
> +                    goto nextb;
> +                apos = j;
> +            }
> +        if (apos == -1)
> +            continue;

My C is very rusty, so I'm finding this pretty hard to follow.  Part of it is
the short variable names.  It would help me if "a" and "b" were "lines_a" and
"lines_b" for instance.  And I'm no longer in the habit of expecting "i" and "j"
to be random loop variables :)

Part of it also is that I don't actually know how the patience algorithm works
:)

I think perhaps some comments back up where you define the structs going into a
little more detail about how they hang together would be helpful.

Also, I'm starting to think that a "#define SENTINEL -1" might be a good idea,
to make the intent of this value a little more transparent.

I'd insert more comments, too.  e.g. in the "if (apos != -1)" block, say 
"/* this line isn't unique in a */".

As far as I can tell, the "i" and "j" loop variables don't have overlapping
lifetimes in this function.  Why not just use "i" everywhere?  When I see "j" I
start to expect more complexity, because I assume "i" must already be doing
something else.

> +
> +        /* find duplicates of b[bpos] in b */
> +        for (j = h[e].bhead; j != -1; j = b[j].next)
> +            if (j != bpos && blo <= j && j < bhi)
> +                goto nextb;

So this loop doesn't just find duplicates, but it skips to trying the next line
if it does find one.  The comment could be clearer about what happens when it
finds a duplicate.

> +
> +        nbpos = bpos - blo;
> +        napos = apos - alo;
> +        btoa[nbpos] = napos;

"nbpos"?  You're using "n" as an abbreviation for "length" (or perhaps
"distance")?  That's a bit odd.  And why is "pos" still in the identifier for a
length?  I think there are probably better names for these variables.

It's probably clearer if you're familiar with this algorithm, but I'm not.

> +/* Adds a new line to the list of matching blocks, either extending the
> +   current block or adding a new one. */
> +static inline void
> +add_matching_line(struct matching_block *matches, Py_ssize_t *n, Py_ssize_t a, Py_ssize_t b)
> +{
> +    Py_ssize_t t = *n - 1;
> +    if ((t >= 0) && (a == matches[t].a + matches[t].len) &&
> +                    (b == matches[t].b + matches[t].len)) {
> +        matches[t].len++;
> +    }
> +    else {
> +        t++;
> +        matches[t].a = a;
> +        matches[t].b = b;
> +        matches[t].len = 1;
> +        (*n)++;
> +    }
> +}

What is "n" here, and why does this function sometimes increment it?  If it were
called "matches_count" its purpose would be more obvious.

Although I wonder if it would be even clearer to make a struct like:

struct matching_blocks {
    struct matching_block *matches;
    Py_ssize_t count;
}

And pass in a reference to that as a single "object", rather than the pair of
"*matches" and "*n", which are conceptually part of the same thing.

It's not clear to me why it is safe here to assume that matches has enough space
allocated for t++?

> +static Py_ssize_t
> +recurse_matches(struct matching_block *answer, Py_ssize_t answersize,
> +                struct bucket *hashtable, Py_ssize_t *backpointers,
> +                struct line *a, struct line *b,
> +                Py_ssize_t alo, Py_ssize_t blo, Py_ssize_t ahi, Py_ssize_t bhi,
> +                int maxrecursion)

And recurse_matches takes another pair, *answer/answersize, that would be
clearer I think using a "struct matching_blocks".  Grouping these sorts of
intimately related variables into structs helps keep the number of parameters to
these functions down as well.

There may even be a tiny speed boost from pushing less stuff on to the stack
with each recursive call :)

[...]
> +    last_a_pos = alo - 1;
> +    last_b_pos = blo - 1;

I'm not sure why you subtract 1 from these values, especially seeing as more
often than not you seem to add it back on again when you use them later.

> +
> +    lcs = (struct matching_line *)malloc(sizeof(struct matching_line) * (bhi - blo));
> +    if (lcs == NULL)
> +        return -1;

You get brownie points for actually checking the return value of malloc :)

[...]
> +        new++;

I'd just do "new = 1" here.  There's no need to increment, you're just setting a
flag, so make it clear that that's all you're doing.  This also reduces the
chances of an arithmetic overflow on insanely large input :)

> +static Py_ssize_t
> +load_lines(PyObject *orig, struct line **lines)
> +{
> +    Py_ssize_t size, i, j;
> +    int h;
> +    char *p;
> +    struct line *line;
> +    PyObject *seq, *item;
> +
> +    seq = PySequence_Fast(orig, "");
> +    if (seq == NULL) {
> +        PyErr_NoMemory();
> +        return -1;
> +    }

I'm don't think this is right.  If PySequence_Fast fails due to a memory error,
I'd expect it would already set the error state appropriately, so I don't think
you should unconditionally convert all errors (including the likely TypeErrors
if a non-sequence is passed in) into MemoryErrors here.

> +    for (i = 0; i < size; i++) {
> +        item = PySequence_Fast_GET_ITEM(seq, i);
> +        if (!PyString_Check(item)){
> +            PyErr_Format(PyExc_TypeError,
> +                     "sequence item %zd: expected string,"
> +                     " %.80s found",
> +                     i, item->ob_type->tp_name);
> +            Py_DECREF(seq);
> +            return -1;
> +        }
> +        line->len = PyString_GET_SIZE(item);
> +        line->data = p = PyString_AS_STRING(item);
> +        h = 5381;
> +        for (j = 0; j < line->len; j++)
> +            h = ((h << 5) + h) + *p++;
> +        line->hash = h;
> +        line->next = -1;
> +        line++;
> +    }

This is another reason to use Python dicts... so you don't have to reinvent
hashing :)

At the very least, make a nice symbolic constant for that magic number.  But
really, you already have these strings as PyObjects, I think you'd be better off
leaving them as PyStrings rather than building your own hash map data structure.

(And if there's a good reason why I'm wrong, then add comments justifying doing
it this way to the code!)

> +static PyObject *
> +py_unique_lcs(PyObject *self, PyObject *args)
> +{
> +    PyObject *aseq, *bseq, *res, *item;
> +    Py_ssize_t asize, bsize, i, nmatches, *backpointers = NULL;
> +    struct line *a = NULL, *b = NULL;
> +    struct matching_line *matches = NULL;
> +    struct bucket *hashtable = NULL;
> +
> +    if (!PyArg_ParseTuple(args, "OO", &aseq, &bseq))
> +        return NULL;
> +
> +    asize = load_lines(aseq, &a);
> +    bsize = load_lines(bseq, &b);
> +    if (asize == -1 || bsize == -1)
> +        goto error;

Yeah, here you feed python objects from the user straight to load_lines.  You
definitely should not be blindly assuming errors from PySequence_Fast are memory
errors!

> +{
> +    PyObject *aseq, *bseq, *item, *answer;
> +    int maxrecursion;
> +    Py_ssize_t i, j, asize, bsize, nmatches, alo, blo, ahi, bhi;
> +    Py_ssize_t *backpointers = NULL;
> +    struct line *a = NULL, *b = NULL;
> +    struct matching_block *matches = NULL;
> +    struct bucket *hashtable = NULL;
> +
> +#if PY_VERSION_HEX < 0x02050000
> +    if (!PyArg_ParseTuple(args, "OOiiiiOi", &aseq, &bseq, &alo, &blo,
> +                          &ahi, &bhi, &answer, &maxrecursion))
> +#else
> +    if (!PyArg_ParseTuple(args, "OOnnnnOi", &aseq, &bseq, &alo, &blo,
> +                          &ahi, &bhi, &answer, &maxrecursion))
> +#endif
> +        return NULL;

I'm a bit nervous about taking values like alo from Python code... it seems like
it would make it easy for pure python to trick this module into crashing.

> +    nn = n + n;

Heh :)

I've haven't really looked at the Python changes, they mostly seemed to be
rearranging existing code.

-Andrew.