[RFC] Custom diff algorithm for Inventories

Sun Dec 3 14:47:40 GMT 2006

Over this last week, I've been working on converting some larger
projects into bzr. I wrote an importer that uses the output from cvsps
to import data into bzr. I spent a lot of time optimizing it, and now it
is quite fast. For those interested it is here:
  https://launchpad.net/products/bzr-cvsps-import

Anyway, I pretty much optimized all I can from the outside, and now I'm
looking at the internals of bzr, and how we can make this sort of thing
even faster. Especially since that will improve performance across the
board.

Right now, the bottlenecks are the time inserting into knits, and
handling the inventory. (Serializing it to XML, diffing it against the
previous XML, and inserting it into the inventory.knit).

My feeling is that the best way to handle the inventory is to actually
split it up, which is what we've discussed for a while, and is what I'll
be working on. But I was thinking there might be a short term improvement.

Basically, we use the same diff algorithm for inventories that we use
for all other data types. But inventory texts conform to a few basic
features, that we should be able to utilize to improve diff performance.

1) They are written in sorted order. ['a', 'b/', 'b/c', 'd'] is the
order that we write them in. We have a small problem in that we don't
write the complete path, so we would probably need higher level
understanding to utilize this fully, but at the very least, the fact
that the file is sorted should affect how we diff it.

2) If the 'file_id' matches, we have a match, even if the rest of the
line doesn't match. Which means we don't need to keep searching for
another potential match to that line. Lines are written out as:

  <directory file_id="XXYYZZ"
  or
  <file file_id="XXYYZZ"
  or
  <file executable="yes" file_id="XXYYZZ"

But basically a regex like:
  r'file_id="(?P<file_id>[^"]+)"'
Should be able to extract the file_id from each line, without trying to
have too much information about the line.

Because of the sorting and the file_id stuff, it is unlikely that we
have to search very far for matches. And I also think we can optimize
for the common case, where the only thing that changes are the actual
texts, and things aren't being added, deleted, or renamed.

My idea is to have something like:

file_id_re = re.compile(...)

source_offset = 0
target_offset = 0
source = source_lines[source_offset]
target = target_offset[target_offset]
n_source_lines = len(source_lines)
n_target_lines = len(target_lines)

diff_lines = []
unknown_source_lines = {}
unknown_target_lines = {}

source_file_id = None
target_file_id = None

while source is not None and target is not None:
  line_matched = False
  if source == target:
    line_matched = True
  else:
    # extract the file_ids for this line
    if source_file_id is None:
      source_file_id_match = file_id_re.match(source)
      if source_file_id_match:
        source_file_id = source_file_id_match.group('file_id')
    # same for target
    ...
    if source_file_id == target_file_id:
      diff_lines.append((target_offset, source_offset, target, source))
      line_matched = True

  if line_matched:
    source_offset += 1
    if source_offset > n_source_lines:
      source = None
    else:
      source = source_lines[source_offset]
    target_offset += 1
    ... # Same for target
  else:
    # Check for a matching line by both line text and by file_id
    # If you find a match, add it to diff_lines, and remove it from
    # unknown_lines
    # Increment the one (or both) that matched, and continue processing

# Now we're done with the loop, move all of the unmatched stuff into
# diff_lines

At this point, I think you can sort diff_lines, and build up the diff.
I'm pretty sure you need sort. Though another possibility would be to
make 'diff_lines' a dictionary indexed by target_offset, or something
like that.

Anyway, with this algorithm you would do a single pass over the data,
and most of it would be matched right away. So you would only keep a
small unmatched buffer. You only have to extract the file_id for lines
that don't match exactly. And by iterating over both at the same time,
you will try to stay in sync on both sides.

I think there are some small things that I'm missing. Like either you
always iterate forward, which means once they are out of sync, every
line goes into the unmatched, and then gets pulled out again. Or you
might get stuck on one side with an entry that will never match. Maybe
just a simple 'times_checked' or something like that. And if you exceed
a threshold, you move forward on that side, looking for matches, and
then go back to the standard algorithm.

Anyway, I haven't thought out all the details, but I *think* we could
make a diff operation that is a lot faster than what we have today.

Another possibility would be to pass PatienceMatcher a flag to indicate
it shouldn't try to recurse very deeply (perhaps never). Because we
should never get non-unique lines in an inventory. Every line should be
different, which means that if you didn't get a match on the first pass,
you aren't going to get one in a smaller pass.

I think something like this would be generally useful, even if we change
the inventory format in the future. But more importantly, it would be an
easier thing to write and validate than the complete change to the
inventory that I'm working on. Is anyone interested in playing around
with this?

John
=:->

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 254 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20061203/f8e41723/attachment.pgp