[RFC] Custom diff algorithm for Inventories

Sun Dec 3 22:36:34 GMT 2006

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Aaron Bentley wrote:
> John Arbash Meinel wrote:
>>> Right now, the bottlenecks are the time inserting into knits, and
>>> handling the inventory. (Serializing it to XML, diffing it against the
>>> previous XML, and inserting it into the inventory.knit).
>>>
>>> My feeling is that the best way to handle the inventory is to actually
>>> split it up, which is what we've discussed for a while, and is what I'll
>>> be working on. But I was thinking there might be a short term improvement.
> 
> At first I thought you were looking at ways to optimize tree comparison.
>  This wouldn't really help there, except maybe for comparing
> RevisionTrees.  You're actually trying to optimize knit creation, right?

Well, eventually I want to optimize tree comparison. But this algorithm
is just about knit creation.

> 
> I agree with your notion that diffing sorted data should be cheaper than
> diffing unsorted data.  I think it's O(n) complexity.
> 
> The algorithm I would use is
> 
> 1. read a line from text A into current-a
> 2. read a line from text B into current-b
> 3. if current-a == current-b:
>   current-a is common.  Goto 1
> 5. compare the lines
> 6. if current-a > current-b:
>   current-b is unique to text B.  read a line from text B into
>   current-b. goto 3.
> 7. if current-b > current-a:
>   current-a is unique to text A.  read a line from text A into
>   current-a. goto 3.
> 
> Aaron

The problem is that the lines aren't textually sorted. They are sorted
by path. And we don't actually include the full path in the line, just
the basename and directory id. So it involves parsing a lot of the line
to figure out whether one is > the other one.

It certainly could still be worth it. I was trying to think of something
that could work with as little actual parsing of the line text.

But we probably could do something with a similar regex that looked for
the file_id, name, and parent_id.

It means we would have to extract the file id for every line, or at
least for every directory entry. You could tell if a line didn't match
without it, but you don't know which one is greater unless you track all
of the directories. I suppose you would only have to keep track of the
current stack of directories, though.

I might just play with this anyway. Thanks for the reminder.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFc1FyJdeBCYSNAAMRAtZmAJ9Myeyow4wCcOwnnshX3zjVCO66HACfZQAe
SvOHmyQc19/QKlWm56TR5p0=
=enY/
-----END PGP SIGNATURE-----