[MERGE] Implement guess-renames

Alexander Belchenko bialix at ukr.net
Mon Mar 23 20:25:52 GMT 2009


Thank you very much, it's brilliant explanation.

John Arbash Meinel пишет:
> Alexander Belchenko wrote:
>> John Arbash Meinel ?8H5B:
>>> Ali Sabil wrote:
>>>> Hi,
>>>>
>>>> I am wondering what is the difference between this command, and the
>>>> command provided by the automv plugin ?
>>>>
>>>> Cheers,
>>>>
>>>> -- 
>>>> Ali
>>>>
>>> My guess is a different author, and a very different algorithm. :)
>>>
>>> But I haven't ever used automv to know what algorithm it uses. I just
>>> doubt it was an edge based algorithm.
>> What is "edge based algorithm"?
> 
> Rather than using individual lines to determine the match, it uses pairs
> of lines (one can argue that it is the 'edge' between the lines).
> 
> So if one file has
> 
> foo
> bar
> baz
> 
> And the other has
> baz
> foo
> bar
> 
> The "baz" line doesn't match, because it isn't preceded by "bar".
> 
> For this specific case, I'm not sure if it makes a huge difference. In
> theory an "edge" system has a few different dynamics. Mostly due to the
> fact that edges can "chain", because there is overlap between what is
> being matched. If you consider 3 files:
> 
> foo   baz   baz
> bar   foo   bar
> baz   bar   foo
> 
> The texts for 1&2 are more similar than the texts for 1&3 or 2&3. They
> all have the same individual lines, but the lines have all been
> rearranged in 3 (no edge matches) while 1&2 both have "foo => bar".
> 
> Another way to look at it. Take a file text, copy it, and then copy it
> again, only this time you move a chunk of text somewhere else in the
> file. A line-based match would see exactly the same lines, an edge-based
> match would see that one had broken a couple of the edges, and did not
> match as well.
> 
> 
> I think the big win of Aaron's algorithm is that he hashes every line in
> the unmatched source, and then effectively compares all the unknown
> files against all possible sources. By using a hash and a dict, I think
> he turns an O(N^2) operation into an O(NlogN). Further, he only needs to
> store the hash, rather than the content of each line. Which means he
> might have some false positives, but they will be rare (1 in 10M), and
> should filter out with all the other content being compared.
> 
> Also, he can compare everything-to-everything, and then find the pair
> that matches the best, filter that match out, then find the next pair
> that matches the best, filter those out, etc. This allows you to match
> files that have changed quite a bit (aren't very similar), because you
> know it isn't likely to be anything else (because the more similar ones
> have already been taken).
> 
> Using the number of hits for a hash as a "strength" measurement is good,
> since a file which is pure GPL text will match every line of self
> against all other files.
> 
> One thing that could be used to strengthen the algorithm is to do
> bi-directional matching. (Just because text_a is very similar to text_b,
> doesn't mean that text_b is very similar to text_a.)
> 
> Another possibility is to take the hit count for a given file, and
> divide it by the number of edges/lines in that file. This takes an
> advantage away from really big files. As an edge case, imagine if you
> had 4 files, and then a 5th file which was the concatenation of all 4
> files. If any of the 4 files is modified slightly, the 5th file actually
> matches *better* than any of the individual files.
> 
> Bi-directional matching, or normalizing by line count would help
> diminish this problem. But honestly, it isn't a common problem, so I'm
> not sure if it is worthwhile to fix.
> 
> As another example, consider:
> 
> a   a   a   a
> b   X   b   b
> c   b   N   c
> d   Y   b
> e   c   c
>     Z
>     d
>     e
> 
> It is pretty arguable which texts should be matched together. I'm pretty
> sure the current algorithm will cause a tie between the 1&3 and 1&4
> text, because they both have both a=>b b=>c edges. Note, though, that
> a=>b and b=>c are not "strong" edges, because they are common. So it
> might pick 1&2 because d=>e is a "strong" edge, only occurring 2 times.
> 
> John
> =:->




More information about the bazaar mailing list