[MERGE][RFC] C implementation of PatienceSequenceMatcher

John Arbash Meinel john at arbash-meinel.com
Mon Jul 23 16:58:16 BST 2007


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Lukáš Lalinský wrote:
> On Ne, 2007-07-22 at 09:05 -0700, Martin Pool wrote:
> 
>> You labeled this just 'rfc' - is this ready to be reviewed?
> 
> I had some time today and I think I've got the code into a state I'm
> happy with. Attaching the code and a modified patience_test plugin I
> used for "mass benchmarking".
> 
> The base of the algorithm is identical to the original Python
> implementation. The only difference is in finding unique lines. The
> Python implementation uses dicts for this and builds them on every
> unique_lcs call. The C version pre-builds a hash table of lines from `b`
> and their equivalents from `a`. Each entry in this table contains a
> linked list of lines from `b` and their matching lines from `a`. Lines
> from `a` which are not also in `b` are not added to the table at all.
> 
> Function unique_lcs then only goes through the hash table, checks how
> many lines within the specified range is in the linked lists. If both
> numbers are 1, it continues with the standard patience sorting
> algorithm.
> 


I think your version may be a good starting point, but it needs a small
improvement to be 'complete'. We probably just need better Patience tests for
this, but the specific issue is a pattern like:

a
b
c
d
b
c
e

versus something like:

a
f
b
c
g
d
b
c
e

You should be able to see that '+f' and '+g' are the only changes.

Note that 'b' and 'c' are not globally unique lines. However, they *are* unique
in the context between 'a' and 'd'.

The way Patience works, is that once it finds the outermost unique lines, it
then finds inner-unique lines. Which is why 'unique_lcs' rebuilds the
corresponding mapping.

I think we can do a lot better, by taking the global unique matching, and
culling it based on what lines are in the restricted range.


The test for this looks like (in test_recurse_matches):
        # Even though 'bc' is not unique globally, and is surrounded by
        # non-matching lines, we should still match, because they are locally
        # unique
        test_one('abcdbce', 'afbcgdbce', [(0,0), (1, 2), (2, 3), (3, 5),
                                          (4, 6), (5, 7), (6, 8)])


And, in fact, your _c implementation fails in a semi-bad way, by repeating
matches that existed earlier:
a = [(0, 0), (1, 2), (1, 2), (3, 5), (3, 5), (3, 5), (3, 5)]
b = [(0, 0), (1, 2), (2, 3), (3, 5), (4, 6), (5, 7), (6, 8)]

It also looks like you got the endpoints wrong. I'm guessing it finds the (1,2)
and (3,5) matches, and then recurses into them, and then *finds* them again,
rather than only searching inbetween them.

It also seems to be completely missing the trailing 'bce' match.



And one other testing thing. You always import 'bzrlib._patience_diff_c' at the
top of the file. But if they don't have the compiled extension, then that will
fail. And the test suite won't even load.

You did the right thing in not running the tests if you cannot import that
module, but you made a small mistake in requiring the module to be present in
order to do anything.

Also, in your test setup you do:

class TestPatienceDiffLib(TestCase):

    _patiencediff_module = bzrlib._patiencediff_py

    def setUp(self):
        super(TestPatienceDiffLib, self).setUp()
        self._saved_unique_lcs = bzrlib.patiencediff.unique_lcs
        self._saved_recurse_matches = bzrlib.patiencediff.recurse_matches
        self._saved_PatienceSequenceMatcher = \
            bzrlib.patiencediff.PatienceSequenceMatcher
        bzrlib.patiencediff.unique_lcs = self._patiencediff_module.unique_lcs
        bzrlib.patiencediff.recurse_matches = \
            self._patiencediff_module.recurse_matches
        bzrlib.patiencediff.PatienceSequenceMatcher = \
            self._patiencediff_module.PatienceSequenceMatcher
^- Here you are overriding the global functions, based on what version you want
to be testing.

Which is fine, except you don't add a cleanup to restore them to their original
values. Modifying global state is only okay if you restore it when you are done.


So -1 for now. We want the test suite to always be able to run, and just give
warnings about not having the compiled extensions.

Also, we should have a test that 'bzrlib.patiencediff.unique_lcs' is the
compiled version if we have it available.  (See my tests in
test__dirstate_helpers.py).


I haven't really reviewed your code yet, but this should give you some things
to start on.

Basically, you should make sure that:

'rm bzrlib/*.so; ./bzr selftest' still runs, and just gives you some "XX tests
could not be run because of missing dependencies".

And that those go away with 'make; ./bzr selftest'

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD4DBQFGpNAYJdeBCYSNAAMRAjesAJ4w4ftcYb5Qs3DNe7YwIh8+EswE6ACYomXL
RiCxK5nOmnnq3ATw9l0Jyw==
=xZDw
-----END PGP SIGNATURE-----



More information about the bazaar mailing list