[RFC] Custom diff algorithm for Inventories

Sun Dec 3 21:34:52 GMT 2006

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

John Arbash Meinel wrote:
> Right now, the bottlenecks are the time inserting into knits, and
> handling the inventory. (Serializing it to XML, diffing it against the
> previous XML, and inserting it into the inventory.knit).
> 
> My feeling is that the best way to handle the inventory is to actually
> split it up, which is what we've discussed for a while, and is what I'll
> be working on. But I was thinking there might be a short term improvement.

At first I thought you were looking at ways to optimize tree comparison.
 This wouldn't really help there, except maybe for comparing
RevisionTrees.  You're actually trying to optimize knit creation, right?

I agree with your notion that diffing sorted data should be cheaper than
diffing unsorted data.  I think it's O(n) complexity.

The algorithm I would use is

1. read a line from text A into current-a
2. read a line from text B into current-b
3. if current-a == current-b:
  current-a is common.  Goto 1
5. compare the lines
6. if current-a > current-b:
  current-b is unique to text B.  read a line from text B into
  current-b. goto 3.
7. if current-b > current-a:
  current-a is unique to text A.  read a line from text A into
  current-a. goto 3.

Aaron
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFc0L70F+nu1YWqI0RAg/oAJ9deLXEw+o9L9n1TbCUPtrFPZhI8QCfcsCb
mqHms5gvNSk95gqDrA1GSbA=
=cC2x
-----END PGP SIGNATURE-----