[MERGE][RFC] C implementation of PatienceSequenceMatcher

Fri Aug 3 13:26:26 BST 2007

On Pi, 2007-08-03 at 16:12 +1000, Andrew Bennetts wrote:
> !comment
> 
> Hmm.  C code really is much harder to read and trust than Python code.  So I'm
> pretty hesitant to say we should have this.  I'm not against it, but in my
> opinion the benefits have to be pretty large... the burden of this code is much
> larger than most of our code, just because it's C.

I know, that's why it's marked as RFC. I really wasn't sure if it's a
good idea to include code like this in a Python project.

> It is actually pretty good as far as C code goes, though!  It is pedantic about
> checking return values, and generally it's fairly straightforward to follow how
> the memory allocations and refcounts go, even in the error-handling.  There's a
> lot of mildly obscure loops though, particularly ones doing stuff like:
> 
> > +        for (j = a[i].hash & hsize; hashtable[j].bhead != -1; j = (j + 1) & hsize)
> 
> or:
> 
> > +    for (i = bsize - 1; i >= 0; i--) {
> 
> Which is simple enough, but you're left wondering why it iterates backwards (and
> the comments don't explain why).

Actually, the reason why other one iterates forwards was actually just a
fragment from older code, when I wanted to have the lists in the hash
table sorted in a particular way. I've changed it and now they both
iterate backwards.

> Probably this code needs commenting even more thoroughly than C code usually
> needs, because this community doesn't have much C expertise.

I've added some more comments.

> I guess my main worry is that the intersection of people with good C skills and
> with a good understanding of patience diff internals is pretty tiny.  So if
> something goes wrong, we may struggle to find the time to properly investigate
> and fix it.

Well, if something goes wrong there is always the option to simply
remove the code and use the Python version as the primary one.

> I also think this code ought to use Python's dictionary type rather than
> building its own hashmap datastructure.  A good chunk of the hairiest code would
> be unnecessary if it used PyDicts instead.  So I'd be very interested in hearing
> if this code still performs as well with PyDicts.

It wouldn't. The hash table it uses is not a traditional hash table, in
fact it's questionable if it can be called a hash table at all. The main
difference is that it does not use hashes for lookups, it uses them only
to build the table and from that point it uses keys ("equivalence
classes") with real O(1) access time. The side effect of this is also
that with this data I can represent text lines by single int.

This is probably the only advantage of this C code, everything other are
just constant-time changes which are not a big win. C code based on
PyDicts would almost equally fast as the Python version on bigger
inputs, because as I've find out the patience diff algorithm is a
competition of data structures, not languages (for example, C++ version
based on standard containers was slower than the Python version, just
because PyDicts are so optimised). 

> On to the review...

Thanks for the review!

> [...]
> > +enum {
> > +    OP_EQUAL = 0,
> > +    OP_INSERT,
> > +    OP_DELETE,
> > +    OP_REPLACE
> > +};
> > +
> > +
> > +static char *opcode_names[] = {
> > +    "equal",
> > +    "insert",
> > +    "delete",
> > +    "replace",
> > +};
> 
> You should have a comment explicitly saying that the enum values need to
> correspond to the order of the names in the opcode_names array.

Done.

> [...]
> > +struct bucket {
> > +    Py_ssize_t ahead;  /* first item in `a` from this equivalence class */
> > +    Py_ssize_t an;
> > +    Py_ssize_t bhead;  /* first item in `b` from this equivalence class */
> > +    Py_ssize_t bn;
> > +};
> 
> I'd be tempted to call those "a_head" and "b_head", just to avoid the confusion
> of looking like the word "ahead".  But that's probably just me :)

Renamed.

> [...]
> > +static inline Py_ssize_t
> > +bisect_left(Py_ssize_t *list, Py_ssize_t item, Py_ssize_t lo, Py_ssize_t hi)
> > +{
> > +    while (lo < hi) {
> > +        Py_ssize_t mid = (lo + hi) / 2;
> > +        if (list[mid] < item)
> > +            lo = mid + 1;
> > +        else
> > +            hi = mid;
> > +    }
> > +    return lo;
> > +}
> 
> In recent Pythons, the standard library's bisect module is implemented in C.
> Would it be a significant performance hit (or a significant inconvenience) to
> reuse that?  The obvious downside is that those bisect functions work on
> PyObjects, rather than on Py_ssize_t.

I'd need to use PySequences and PyObjects. Also as I use this function
only in one place, it's defined as inline and will be inserted directly
to the calling function. Otherwise the C code in Python is almost
identical to this.

> Oh, and this code might overflow if (lo + hi) is too big... I guess we probably
> have other problems to worry about before we can cope with datasets that large
> though! :)

Fixed.

> [...]
> > +static struct bucket *
> > +equate_lines(struct line *a, struct line *b, Py_ssize_t asize, Py_ssize_t bsize)
> > +{
> > +    Py_ssize_t i, j, hsize;
> > +    struct bucket *hashtable;
> > +
> > +    /* build a hash table of the next highest power of 2 */
> > +    hsize = 1;
> > +    while (hsize < bsize + 1)
> > +        hsize *= 2;
> 
> Hmm.  Is it really worth implementing your own hash table?  Python dictionaries
> are pretty heavily optimised, even though they do use PyObjects.  I'm pretty
> reluctant to maintain our own code for this when Python already has very good
> (and heavily tested!) code in this area.  Could you try using Python dict
> objects here, and see what the impact is (both on performance, and on complexity
> of your own code)?

Yes, as explained above - it isn't a traditional hash table. The side
product of building this table (it allows the code to compare text lines
just by comparing single int) is actually more interesting for the
algorithm than the hash table itself.

> (And again, there's a risk of overflow here!)

Fixed.

> > +
> > +    hashtable = (struct bucket *)malloc(sizeof(struct bucket) * hsize);
> > +    if (hashtable == NULL) {
> > +        PyErr_NoMemory();
> > +        return NULL;
> > +    }
> > +
> > +    for (i = 0; i < hsize; i++) {
> > +        hashtable[i].an = 0;
> > +        hashtable[i].bn = 0;
> > +        hashtable[i].ahead = -1;
> > +        hashtable[i].bhead = -1;
> > +    }
> > +    hsize--;
> > +
> > +    /* add lines from b to the hash table chains */
> > +    for (i = bsize - 1; i >= 0; i--) {
> > +        /* find the first h entry, which is either empty or contains
> > +           the same line as b[i] */
> > +        for (j = b[i].hash & hsize; hashtable[j].bhead != -1; j = (j + 1) & hsize)
> > +            if (!compare_lines(b + i, b + hashtable[j].bhead))
> > +                break;
> 
> Oh boy :)

I know, this describes this part of code pretty accurately :)

> This is exactly the sort of code I hope we can avoid by using Python dicts :)
> 
> I don't do much C, but I'd be more comfortable with braces in those for and if
> statements, to discourage easy mistakes...

Done, and added more comments. Hopefully it's more clear now.

> [...]
> > +        /* find the first h entry, which is either empty or contains
> > +           the same line as a[i] */
> > +        for (j = a[i].hash & hsize; hashtable[j].bhead != -1; j = (j + 1) & hsize)
> > +            if (!compare_lines(a + i, b + hashtable[j].bhead))
> > +                break;
> 
> Hmm, this duplicated logic is unfortunate.  If we do keep it, I'd like to see it
> factored out somehow... code this complex should only exist in one place!

Moved to an separate function.

> [...]
> > +/* Finds longest common subsequence of unique lines in a[alo:ahi] and
> > +   b[blo:bhi].
> > +   Parameter backpointers must have allocated memory for at least
> > +   4 * (bhi - blo) ints. */
> 
> I wouldn't mind a comment somewhere (perhaps inside the function rather than
> right here) elaborating on how you arrive at that 4 * (bhi - blo) figure.  I
> think I can see why, but some explicit reasoning wouldn't hurt.
> 
> > +Py_ssize_t
> > +unique_lcs(struct matching_line *answer, struct bucket *h, Py_ssize_t *backpointers,
> > +           struct line *a, struct line *b, Py_ssize_t alo, Py_ssize_t blo, Py_ssize_t ahi, Py_ssize_t bhi)
> > +{
> > +    Py_ssize_t i, j, k, e, apos, bpos, napos, nbpos, bsize, stacksize;
> > +    Py_ssize_t *stacks, *lasts, *btoa;
> > +
> > +    k = 0;
> > +    stacksize = 0;
> > +    bsize = bhi - blo;
> > +
> > +    stacks = backpointers + bsize;
> > +    lasts = stacks + bsize;
> > +    btoa = lasts + bsize;
> 
> Some explanation of these variables' roles would be good.
> 
> "stacks" and "lasts" have rather different names, but looking at the code the
> difference seems to be "stacks is about lines from a" and "lasts is about lines
> from b".  I guess this terminology is used in the description of the patience
> diff algorithm?

In the Python implementation, I didn't want to diverge too much from the
original code.

> The unusual way you use backpointers for four separate arrays deserves some
> comments too.

I've added some comments there and also short description of the
algorithm, including descriptions which variables contains what.

> > +
> > +    for (i = 0; i < bsize; i++)
> > +        backpointers[i] = -1;
> 
> A brief comment before this for-loop, along the lines of "initialise all the
> backpointers to -1", would be good.
> 
> Why don't you initialise stacks, lasts and btoa as well?

stacks and lasts depend on stacksize, which is initially 0. btoa is
filled at the run-time.

> > +
> > +    for (bpos = blo; bpos < bhi; bpos++) {
> > +        e = b[bpos].equiv;
> > +
> > +        /* no lines in a or b */
> > +        if (!h[e].an || !h[e].bn)
> > +            continue;
> 
> Ouch, that's a bit too concise.  "h[e].an"?  Please use longer identifiers :)
> 
> > +
> > +        /* find an unique line in a, that matches b[bpos] */
> > +        apos = -1;
> > +        for (j = h[e].ahead; j != -1; j = a[j].next)
> > +            if (alo <= j && j < ahi) {
> > +                if (apos != -1)
> > +                    goto nextb;
> > +                apos = j;
> > +            }
> > +        if (apos == -1)
> > +            continue;
> 
> My C is very rusty, so I'm finding this pretty hard to follow.  Part of it is
> the short variable names.  It would help me if "a" and "b" were "lines_a" and
> "lines_b" for instance.  And I'm no longer in the habit of expecting "i" and "j"
> to be random loop variables :)
> 
> Part of it also is that I don't actually know how the patience algorithm works
> :)
> 
> I think perhaps some comments back up where you define the structs going into a
> little more detail about how they hang together would be helpful.

Added.

> Also, I'm starting to think that a "#define SENTINEL -1" might be a good idea,
> to make the intent of this value a little more transparent.
> 
> I'd insert more comments, too.  e.g. in the "if (apos != -1)" block, say 
> "/* this line isn't unique in a */".
> 
> As far as I can tell, the "i" and "j" loop variables don't have overlapping
> lifetimes in this function.  Why not just use "i" everywhere?  When I see "j" I
> start to expect more complexity, because I assume "i" must already be doing
> something else.

Yep, they i was originally used in the outer loop. I've changed j to i.

> > +
> > +        /* find duplicates of b[bpos] in b */
> > +        for (j = h[e].bhead; j != -1; j = b[j].next)
> > +            if (j != bpos && blo <= j && j < bhi)
> > +                goto nextb;
> 
> So this loop doesn't just find duplicates, but it skips to trying the next line
> if it does find one.  The comment could be clearer about what happens when it
> finds a duplicate.
> 
> > +
> > +        nbpos = bpos - blo;
> > +        napos = apos - alo;
> > +        btoa[nbpos] = napos;
> 
> "nbpos"?  You're using "n" as an abbreviation for "length" (or perhaps
> "distance")?  That's a bit odd.  And why is "pos" still in the identifier for a
> length?  I think there are probably better names for these variables.

nbpos was supposed to be abbreviated "normalised bpos". I've changed the
name to something more meaningful.

> It's probably clearer if you're familiar with this algorithm, but I'm not.
> 
> > +/* Adds a new line to the list of matching blocks, either extending the
> > +   current block or adding a new one. */
> > +static inline void
> > +add_matching_line(struct matching_block *matches, Py_ssize_t *n, Py_ssize_t a, Py_ssize_t b)
> > +{
> > +    Py_ssize_t t = *n - 1;
> > +    if ((t >= 0) && (a == matches[t].a + matches[t].len) &&
> > +                    (b == matches[t].b + matches[t].len)) {
> > +        matches[t].len++;
> > +    }
> > +    else {
> > +        t++;
> > +        matches[t].a = a;
> > +        matches[t].b = b;
> > +        matches[t].len = 1;
> > +        (*n)++;
> > +    }
> > +}
> 
> What is "n" here, and why does this function sometimes increment it?  If it were
> called "matches_count" its purpose would be more obvious.
> 
> Although I wonder if it would be even clearer to make a struct like:
> 
> struct matching_blocks {
>     struct matching_block *matches;
>     Py_ssize_t count;
> }
> 
> And pass in a reference to that as a single "object", rather than the pair of
> "*matches" and "*n", which are conceptually part of the same thing.

This is a good idea, I've added this structure.

> It's not clear to me why it is safe here to assume that matches has enough space
> allocated for t++?

The arrays are always allocated to handle the worst case (in this case
number of lines in b + 1). They are very short living, so it shouldn't
have that bad effect in RAM usage in general.

> > +static Py_ssize_t
> > +recurse_matches(struct matching_block *answer, Py_ssize_t answersize,
> > +                struct bucket *hashtable, Py_ssize_t *backpointers,
> > +                struct line *a, struct line *b,
> > +                Py_ssize_t alo, Py_ssize_t blo, Py_ssize_t ahi, Py_ssize_t bhi,
> > +                int maxrecursion)
> 
> And recurse_matches takes another pair, *answer/answersize, that would be
> clearer I think using a "struct matching_blocks".  Grouping these sorts of
> intimately related variables into structs helps keep the number of parameters to
> these functions down as well.
> 
> There may even be a tiny speed boost from pushing less stuff on to the stack
> with each recursive call :)
> 
> [...]
> > +    last_a_pos = alo - 1;
> > +    last_b_pos = blo - 1;
> 
> I'm not sure why you subtract 1 from these values, especially seeing as more
> often than not you seem to add it back on again when you use them later.

This is how it's done in the Python implementation. As I said above, I
wanted to keep parts of the code that doesn't have to be changed as
close to the original as possible.

> > +
> > +    lcs = (struct matching_line *)malloc(sizeof(struct matching_line) * (bhi - blo));
> > +    if (lcs == NULL)
> > +        return -1;
> 
> You get brownie points for actually checking the return value of malloc :)
> 
> [...]
> > +        new++;
> 
> I'd just do "new = 1" here.  There's no need to increment, you're just setting a
> flag, so make it clear that that's all you're doing.  This also reduces the
> chances of an arithmetic overflow on insanely large input :)
> 
> > +static Py_ssize_t
> > +load_lines(PyObject *orig, struct line **lines)
> > +{
> > +    Py_ssize_t size, i, j;
> > +    int h;
> > +    char *p;
> > +    struct line *line;
> > +    PyObject *seq, *item;
> > +
> > +    seq = PySequence_Fast(orig, "");
> > +    if (seq == NULL) {
> > +        PyErr_NoMemory();
> > +        return -1;
> > +    }
> 
> I'm don't think this is right.  If PySequence_Fast fails due to a memory error,
> I'd expect it would already set the error state appropriately, so I don't think
> you should unconditionally convert all errors (including the likely TypeErrors
> if a non-sequence is passed in) into MemoryErrors here.

You are right, PySequence_Fast will set the right exception on it's own
if it fails.

> > +    for (i = 0; i < size; i++) {
> > +        item = PySequence_Fast_GET_ITEM(seq, i);
> > +        if (!PyString_Check(item)){
> > +            PyErr_Format(PyExc_TypeError,
> > +                     "sequence item %zd: expected string,"
> > +                     " %.80s found",
> > +                     i, item->ob_type->tp_name);
> > +            Py_DECREF(seq);
> > +            return -1;
> > +        }
> > +        line->len = PyString_GET_SIZE(item);
> > +        line->data = p = PyString_AS_STRING(item);
> > +        h = 5381;
> > +        for (j = 0; j < line->len; j++)
> > +            h = ((h << 5) + h) + *p++;
> > +        line->hash = h;
> > +        line->next = -1;
> > +        line++;
> > +    }
> 
> This is another reason to use Python dicts... so you don't have to reinvent
> hashing :)
> 
> At the very least, make a nice symbolic constant for that magic number.  But
> really, you already have these strings as PyObjects, I think you'd be better off
> leaving them as PyStrings rather than building your own hash map data structure.

Except for the function equate_lines, the code doesn't work with strings
at all. It only compares ints.

I've experimented with many hash functions, including the one from
PyString and other supposedly good ones. In this case, when the hash is
not used for actual table lookups, they are not that useful and on the
other hand slower to calculate.

> (And if there's a good reason why I'm wrong, then add comments justifying doing
> it this way to the code!)
> 
> > +static PyObject *
> > +py_unique_lcs(PyObject *self, PyObject *args)
> > +{
> > +    PyObject *aseq, *bseq, *res, *item;
> > +    Py_ssize_t asize, bsize, i, nmatches, *backpointers = NULL;
> > +    struct line *a = NULL, *b = NULL;
> > +    struct matching_line *matches = NULL;
> > +    struct bucket *hashtable = NULL;
> > +
> > +    if (!PyArg_ParseTuple(args, "OO", &aseq, &bseq))
> > +        return NULL;
> > +
> > +    asize = load_lines(aseq, &a);
> > +    bsize = load_lines(bseq, &b);
> > +    if (asize == -1 || bsize == -1)
> > +        goto error;
> 
> Yeah, here you feed python objects from the user straight to load_lines.  You
> definitely should not be blindly assuming errors from PySequence_Fast are memory
> errors!
> 
> > +{
> > +    PyObject *aseq, *bseq, *item, *answer;
> > +    int maxrecursion;
> > +    Py_ssize_t i, j, asize, bsize, nmatches, alo, blo, ahi, bhi;
> > +    Py_ssize_t *backpointers = NULL;
> > +    struct line *a = NULL, *b = NULL;
> > +    struct matching_block *matches = NULL;
> > +    struct bucket *hashtable = NULL;
> > +
> > +#if PY_VERSION_HEX < 0x02050000
> > +    if (!PyArg_ParseTuple(args, "OOiiiiOi", &aseq, &bseq, &alo, &blo,
> > +                          &ahi, &bhi, &answer, &maxrecursion))
> > +#else
> > +    if (!PyArg_ParseTuple(args, "OOnnnnOi", &aseq, &bseq, &alo, &blo,
> > +                          &ahi, &bhi, &answer, &maxrecursion))
> > +#endif
> > +        return NULL;
> 
> I'm a bit nervous about taking values like alo from Python code... it seems like
> it would make it easy for pure python to trick this module into crashing.

This is not a code that is used in production, it exists only for the
test suite to make it possible to test this function separately.

> > +    nn = n + n;
> 
> Heh :)
> 
> I've haven't really looked at the Python changes, they mostly seemed to be
> rearranging existing code.

Right.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: bzr-cpatience.diff
Type: text/x-patch
Size: 99140 bytes
Desc: not available
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20070803/0aed62e3/attachment-0001.bin 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Toto je =?ISO-8859-1?Q?digit=E1lne?=
	=?ISO-8859-1?Q?_podp=EDsan=E1?= =?UTF-8?Q?_=C4=8Das=C5=A5?=
	=?ISO-8859-1?Q?_spr=E1vy?=
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20070803/0aed62e3/attachment-0001.pgp