Longest Common Subsequences code

Thu Nov 2 08:08:38 GMT 2006

On  1 Nov 2006, John Arbash Meinel <john at arbash-meinel.com> wrote:

> > The Hirschberg algorithm and unix diff (Hunt and McIlroy) both produce
> > minimal diffs. They have different characteristics. If necessary (or if
> > someone is really interested), we can try them.
> > 
> 
> Well, when you have non-unique lines, minimal is much harder to
> establish, because a given section could go either way.
> Also, Aaron touched on the fact that we don't really care about minimal
> diffs. The properties we would like are:
> 
> 1) Delta compression of full-texts. Minimal diffs would make this
> better, but I would guess only marginally so. ATM we are gzip
> compressing the final hunks anyway. The most important thing is that a
> change of 2 lines to a 100Kline source file should be ~2-lines, not
> 100Klines. And I think all algorithms will give us that.
> 
> 2) Good performance, on both small and large N. We don't want an O(N^3)
> algorithm. But we should keep in mind that most of our source is in the
> 1-10K line range. At those levels constant factors can easily dominate.
>  I wouldn't want to see us tuning for 100K line files and lose
> performance on small files.
> 
> 3) Human-friendly deltas. When you run "bzr diff" you are meaning to
> review your changes, and possibly send them to someone else. Doubly so
> if you are using "bzr bundle". Which means that having the absolute
> minimum diff isn't always better than having a larger diff that is more
> understandable.

It's worth mentioning that we don't necessarily need all of these at the
same time.  When we're showing a diff to a person from bzr diff, or
putting it as the summary in a bundle, or using it to generate
annotations, then it's worth doing more work to make it align with the
semantic changes.  (For example

>  def original():
> +  a line of code
> +  another line of code
> +
> +def copy():
>    a line of code
>    another line of code
> 

>From what I read of the Hirschberg algorithm it will not do particularly
well at this, or at least it's not a particular criteria.

When storing texts however we care much more about size and speed, and
it doesn't really matter if the diff makes sense to a human.

-- 
Martin