Longest Common Subsequences code
Martin Pool
mbp at canonical.com
Thu Nov 2 08:08:38 GMT 2006
On 1 Nov 2006, John Arbash Meinel <john at arbash-meinel.com> wrote:
> > The Hirschberg algorithm and unix diff (Hunt and McIlroy) both produce
> > minimal diffs. They have different characteristics. If necessary (or if
> > someone is really interested), we can try them.
> >
>
> Well, when you have non-unique lines, minimal is much harder to
> establish, because a given section could go either way.
> Also, Aaron touched on the fact that we don't really care about minimal
> diffs. The properties we would like are:
>
> 1) Delta compression of full-texts. Minimal diffs would make this
> better, but I would guess only marginally so. ATM we are gzip
> compressing the final hunks anyway. The most important thing is that a
> change of 2 lines to a 100Kline source file should be ~2-lines, not
> 100Klines. And I think all algorithms will give us that.
>
> 2) Good performance, on both small and large N. We don't want an O(N^3)
> algorithm. But we should keep in mind that most of our source is in the
> 1-10K line range. At those levels constant factors can easily dominate.
> I wouldn't want to see us tuning for 100K line files and lose
> performance on small files.
>
> 3) Human-friendly deltas. When you run "bzr diff" you are meaning to
> review your changes, and possibly send them to someone else. Doubly so
> if you are using "bzr bundle". Which means that having the absolute
> minimum diff isn't always better than having a larger diff that is more
> understandable.
It's worth mentioning that we don't necessarily need all of these at the
same time. When we're showing a diff to a person from bzr diff, or
putting it as the summary in a bundle, or using it to generate
annotations, then it's worth doing more work to make it align with the
semantic changes. (For example
> def original():
> + a line of code
> + another line of code
> +
> +def copy():
> a line of code
> another line of code
>
>From what I read of the Hirschberg algorithm it will not do particularly
well at this, or at least it's not a particular criteria.
When storing texts however we care much more about size and speed, and
it doesn't really matter if the diff makes sense to a human.
--
Martin
More information about the bazaar
mailing list