[MERGE] Implement guess-renames
Alexander Belchenko
bialix at ukr.net
Mon Mar 23 20:25:52 GMT 2009
Thank you very much, it's brilliant explanation.
John Arbash Meinel пишет:
> Alexander Belchenko wrote:
>> John Arbash Meinel ?8H5B:
>>> Ali Sabil wrote:
>>>> Hi,
>>>>
>>>> I am wondering what is the difference between this command, and the
>>>> command provided by the automv plugin ?
>>>>
>>>> Cheers,
>>>>
>>>> --
>>>> Ali
>>>>
>>> My guess is a different author, and a very different algorithm. :)
>>>
>>> But I haven't ever used automv to know what algorithm it uses. I just
>>> doubt it was an edge based algorithm.
>> What is "edge based algorithm"?
>
> Rather than using individual lines to determine the match, it uses pairs
> of lines (one can argue that it is the 'edge' between the lines).
>
> So if one file has
>
> foo
> bar
> baz
>
> And the other has
> baz
> foo
> bar
>
> The "baz" line doesn't match, because it isn't preceded by "bar".
>
> For this specific case, I'm not sure if it makes a huge difference. In
> theory an "edge" system has a few different dynamics. Mostly due to the
> fact that edges can "chain", because there is overlap between what is
> being matched. If you consider 3 files:
>
> foo baz baz
> bar foo bar
> baz bar foo
>
> The texts for 1&2 are more similar than the texts for 1&3 or 2&3. They
> all have the same individual lines, but the lines have all been
> rearranged in 3 (no edge matches) while 1&2 both have "foo => bar".
>
> Another way to look at it. Take a file text, copy it, and then copy it
> again, only this time you move a chunk of text somewhere else in the
> file. A line-based match would see exactly the same lines, an edge-based
> match would see that one had broken a couple of the edges, and did not
> match as well.
>
>
> I think the big win of Aaron's algorithm is that he hashes every line in
> the unmatched source, and then effectively compares all the unknown
> files against all possible sources. By using a hash and a dict, I think
> he turns an O(N^2) operation into an O(NlogN). Further, he only needs to
> store the hash, rather than the content of each line. Which means he
> might have some false positives, but they will be rare (1 in 10M), and
> should filter out with all the other content being compared.
>
> Also, he can compare everything-to-everything, and then find the pair
> that matches the best, filter that match out, then find the next pair
> that matches the best, filter those out, etc. This allows you to match
> files that have changed quite a bit (aren't very similar), because you
> know it isn't likely to be anything else (because the more similar ones
> have already been taken).
>
> Using the number of hits for a hash as a "strength" measurement is good,
> since a file which is pure GPL text will match every line of self
> against all other files.
>
> One thing that could be used to strengthen the algorithm is to do
> bi-directional matching. (Just because text_a is very similar to text_b,
> doesn't mean that text_b is very similar to text_a.)
>
> Another possibility is to take the hit count for a given file, and
> divide it by the number of edges/lines in that file. This takes an
> advantage away from really big files. As an edge case, imagine if you
> had 4 files, and then a 5th file which was the concatenation of all 4
> files. If any of the 4 files is modified slightly, the 5th file actually
> matches *better* than any of the individual files.
>
> Bi-directional matching, or normalizing by line count would help
> diminish this problem. But honestly, it isn't a common problem, so I'm
> not sure if it is worthwhile to fix.
>
> As another example, consider:
>
> a a a a
> b X b b
> c b N c
> d Y b
> e c c
> Z
> d
> e
>
> It is pretty arguable which texts should be matched together. I'm pretty
> sure the current algorithm will cause a tie between the 1&3 and 1&4
> text, because they both have both a=>b b=>c edges. Note, though, that
> a=>b and b=>c are not "strong" edges, because they are common. So it
> might pick 1&2 because d=>e is a "strong" edge, only occurring 2 times.
>
> John
> =:->
More information about the bazaar
mailing list