[MERGE][RFC] C implementation of PatienceSequenceMatcher

Mon Jul 23 17:45:56 BST 2007

On Po, 2007-07-23 at 10:58 -0500, John Arbash Meinel wrote:

> I think your version may be a good starting point, but it needs a small
> improvement to be 'complete'. We probably just need better Patience tests for
> this, but the specific issue is a pattern like:
> 
> a
> b
> c
> d
> b
> c
> e
> 
> versus something like:
> 
> a
> f
> b
> c
> g
> d
> b
> c
> e
> 
> You should be able to see that '+f' and '+g' are the only changes.
> 
> Note that 'b' and 'c' are not globally unique lines. However, they *are* unique
> in the context between 'a' and 'd'.
>
> The way Patience works, is that once it finds the outermost unique lines, it
> then finds inner-unique lines. Which is why 'unique_lcs' rebuilds the
> corresponding mapping.

I know, that's why unique_lcs always goes though the whole list of
unique lines and checks for the range. Even though this still could be
optimized, it's faster than rebuilding the table.

> I think we can do a lot better, by taking the global unique matching, and
> culling it based on what lines are in the restricted range.
> 
> 
> The test for this looks like (in test_recurse_matches):
>         # Even though 'bc' is not unique globally, and is surrounded by
>         # non-matching lines, we should still match, because they are locally
>         # unique
>         test_one('abcdbce', 'afbcgdbce', [(0,0), (1, 2), (2, 3), (3, 5),
>                                           (4, 6), (5, 7), (6, 8)])
> 
> 
> And, in fact, your _c implementation fails in a semi-bad way, by repeating
> matches that existed earlier:
> a = [(0, 0), (1, 2), (1, 2), (3, 5), (3, 5), (3, 5), (3, 5)]
> b = [(0, 0), (1, 2), (2, 3), (3, 5), (4, 6), (5, 7), (6, 8)]
> 
> It also looks like you got the endpoints wrong. I'm guessing it finds the (1,2)
> and (3,5) matches, and then recurses into them, and then *finds* them again,
> rather than only searching inbetween them.
> 
> It also seems to be completely missing the trailing 'bce' match.

Sorry, my bad. Luckily this is only a bug in the Python wrapper for
recurse_matches. On the other hand, I wrapped functions unique_lcs and
recurse_matches *only* for test suite, it's a shame one of them is
buggy.

Initially I had recurse_matches written the way patiencediff.py does it,
which means it returns single-line matches. But later I decided to
combine recurse_matches and _collapse_sequences, because it performs
better than two independant loops, and let it to return triples (a, b,
len). And instead of changing this code in py_recurse_matches:

for (i = 0; i < nmatches; i++) {
    item = Py_BuildValue("nn", matches[i].a, matches[i].b);

to:

for (i = 0; i < nmatches; i++) {
    for (j = 0; j < matches[i].len; j++) {
        item = Py_BuildValue("nn", matches[i].a + j, matches[i].b + j);

I changed it only to:

for (i = 0; i < nmatches; i++) {
    for (j = 0; j < matches[i].len; j++) {
        item = Py_BuildValue("nn", matches[i].a, matches[i].b);

which leads to the mentioned output. But it's not a bug in
PatienceSequenceMatcher.

> And one other testing thing. You always import 'bzrlib._patience_diff_c' at the
> top of the file. But if they don't have the compiled extension, then that will
> fail. And the test suite won't even load.
> 
> You did the right thing in not running the tests if you cannot import that
> module, but you made a small mistake in requiring the module to be present in
> order to do anything.

Fixed, now it loads the module only when it's needed.

> Also, in your test setup you do:
> 
> class TestPatienceDiffLib(TestCase):
> 
>     _patiencediff_module = bzrlib._patiencediff_py
> 
>     def setUp(self):
>         super(TestPatienceDiffLib, self).setUp()
>         self._saved_unique_lcs = bzrlib.patiencediff.unique_lcs
>         self._saved_recurse_matches = bzrlib.patiencediff.recurse_matches
>         self._saved_PatienceSequenceMatcher = \
>             bzrlib.patiencediff.PatienceSequenceMatcher
>         bzrlib.patiencediff.unique_lcs = self._patiencediff_module.unique_lcs
>         bzrlib.patiencediff.recurse_matches = \
>             self._patiencediff_module.recurse_matches
>         bzrlib.patiencediff.PatienceSequenceMatcher = \
>             self._patiencediff_module.PatienceSequenceMatcher
> ^- Here you are overriding the global functions, based on what version you want
> to be testing.
> 
> Which is fine, except you don't add a cleanup to restore them to their original
> values. Modifying global state is only okay if you restore it when you are done.

A few lines below I had:

    def tearDown(self):
        super(TestPatienceDiffLib, self).tearDown()
        bzrlib.patiencediff.unique_lcs = self._saved_unique_lcs
        bzrlib.patiencediff.recurse_matches =
self._saved_recurse_matches
        bzrlib.patiencediff.PatienceSequenceMatcher = \
            self._saved_PatienceSequenceMatcher

But I've changed it, and with the new patch it doesn't monkey-patch the
patiencediff module.

> So -1 for now. We want the test suite to always be able to run, and just give
> warnings about not having the compiled extensions.
> 
> Also, we should have a test that 'bzrlib.patiencediff.unique_lcs' is the
> compiled version if we have it available.  (See my tests in
> test__dirstate_helpers.py).

Thanks, added.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: bzr-patiencediff-c-2.diff
Type: text/x-patch
Size: 89860 bytes
Desc: not available
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20070723/d67b917d/attachment-0001.bin 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Toto je =?ISO-8859-1?Q?digit=E1lne?=
	=?ISO-8859-1?Q?_podp=EDsan=E1?= =?UTF-8?Q?_=C4=8Das=C5=A5?=
	=?ISO-8859-1?Q?_spr=E1vy?=
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20070723/d67b917d/attachment-0001.pgp