[MERGE] Implement guess-renames

Mon Mar 23 17:03:22 GMT 2009

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Alexander Belchenko wrote:
> John Arbash Meinel пишет:
>> Ali Sabil wrote:
>>> Hi,
>>>
>>> I am wondering what is the difference between this command, and the
>>> command provided by the automv plugin ?
>>>
>>> Cheers,
>>>
>>> -- 
>>> Ali
>>>
>>
>> My guess is a different author, and a very different algorithm. :)
>>
>> But I haven't ever used automv to know what algorithm it uses. I just
>> doubt it was an edge based algorithm.
> 
> What is "edge based algorithm"?

Rather than using individual lines to determine the match, it uses pairs
of lines (one can argue that it is the 'edge' between the lines).

So if one file has

foo
bar
baz

And the other has
baz
foo
bar

The "baz" line doesn't match, because it isn't preceded by "bar".

For this specific case, I'm not sure if it makes a huge difference. In
theory an "edge" system has a few different dynamics. Mostly due to the
fact that edges can "chain", because there is overlap between what is
being matched. If you consider 3 files:

foo   baz   baz
bar   foo   bar
baz   bar   foo

The texts for 1&2 are more similar than the texts for 1&3 or 2&3. They
all have the same individual lines, but the lines have all been
rearranged in 3 (no edge matches) while 1&2 both have "foo => bar".

Another way to look at it. Take a file text, copy it, and then copy it
again, only this time you move a chunk of text somewhere else in the
file. A line-based match would see exactly the same lines, an edge-based
match would see that one had broken a couple of the edges, and did not
match as well.

I think the big win of Aaron's algorithm is that he hashes every line in
the unmatched source, and then effectively compares all the unknown
files against all possible sources. By using a hash and a dict, I think
he turns an O(N^2) operation into an O(NlogN). Further, he only needs to
store the hash, rather than the content of each line. Which means he
might have some false positives, but they will be rare (1 in 10M), and
should filter out with all the other content being compared.

Also, he can compare everything-to-everything, and then find the pair
that matches the best, filter that match out, then find the next pair
that matches the best, filter those out, etc. This allows you to match
files that have changed quite a bit (aren't very similar), because you
know it isn't likely to be anything else (because the more similar ones
have already been taken).

Using the number of hits for a hash as a "strength" measurement is good,
since a file which is pure GPL text will match every line of self
against all other files.

One thing that could be used to strengthen the algorithm is to do
bi-directional matching. (Just because text_a is very similar to text_b,
doesn't mean that text_b is very similar to text_a.)

Another possibility is to take the hit count for a given file, and
divide it by the number of edges/lines in that file. This takes an
advantage away from really big files. As an edge case, imagine if you
had 4 files, and then a 5th file which was the concatenation of all 4
files. If any of the 4 files is modified slightly, the 5th file actually
matches *better* than any of the individual files.

Bi-directional matching, or normalizing by line count would help
diminish this problem. But honestly, it isn't a common problem, so I'm
not sure if it is worthwhile to fix.

As another example, consider:

a   a   a   a
b   X   b   b
c   b   N   c
d   Y   b
e   c   c
    Z
    d
    e

It is pretty arguable which texts should be matched together. I'm pretty
sure the current algorithm will cause a tie between the 1&3 and 1&4
text, because they both have both a=>b b=>c edges. Note, though, that
a=>b and b=>c are not "strong" edges, because they are common. So it
might pick 1&2 because d=>e is a "strong" edge, only occurring 2 times.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAknHwNoACgkQJdeBCYSNAAOU4gCeLNfY9MsKnY4MgrVGJyDLKZtw
Sq4AniigEMtq7gTQKvBRj5uuOYn0Z1U5
=Rn+C
-----END PGP SIGNATURE-----