Handling repeated text annotations

Thu Feb 14 20:01:44 GMT 2008

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

So the knit code is actually "right-most always wins". Specifically it has:

~            for parent_id in parents:
		...
~                for i, j, n in seq.get_matching_blocks():
~                    if n == 0:
~                        continue
~                    # this appears to copy (origin, text) pairs across to the
~                    # new content for any line that matches the last-checked
~                    # parent.
~                    content._lines[j:j+n] = merge_content._lines[i:i+n]

So it does a diff against all parents, and then it replaces the lines in the
output with any matches from that parent.

Now, there are some pretty strong performance characteristics for the various
possibilities.

Current "always create a new node", aka tip-wins gives:
$ time ~/dev/bzr/annotate_stuff/bzr annotate --show-ids sql/mysqld.cc >/dev/null
time_in_reannotate_annotated: 51.760s
Clean: 8363598
empty:   77801
new:      2667
left:   243009
right:  294698
tip:   1385106

real    1m4.447s

Left wins:
time_in_reannotate_annotated: 11.070s
Clean: 9727552
empty:  144766
new:        28
left:   176544
right:  294488
tip:     23501

real    0m23.915s

Right wins:
time_in_reannotate_annotated: 11.300s
Clean: 9731252
empty:  146006
new:        42
left:   176323
right:  293926
tip:     19330

real    0m23.723s

Resolve using heads(), the heads: line is entries that resolved to a single node
after checking ancestry:
time_in_reannotate_annotated: 18.780s
Clean: 9673376
empty:  133082
new:       563
left:   188215
right:  294626
head:    75830
tip:      1187

real    0m32.211s

So it seems that left-wins and right-wins are approximately the same for
performance. heads() adds about 50% overhead. This is using a HeadsCache and a
DictParentsProvider because the annotate code was already building up the list
of parents it needed to start annotating.

Is it worth the performance hit for correctness? I'm not sure, considering
nobody seemed to be bothered by how we annotated with knits, and they used a
"right-wins" policy.
They are all better than "tip-wins", though some of that may be the test case.
Though for repository.py the timing is:
tip-wins:   3.5s
right-wins: 3.2s
heads:      4.4s

So I'm tempted to do "heads()" for correctness, but I'm not sure if we win much
versus left-wins or right-wins. Isn't it more useful to see the log message of
one of the nodes that created that line, rather than the node that merges the
identical texts together?

I suppose we also need to consider the effects on annotate-merge. As
right/left-wins might produce more conflicts than a heads() based solution.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHtJ4oJdeBCYSNAAMRApx4AKDLNvqI5JzAH2+mo7gHKYRl43lw5ACgwN9J
tppi0t+MKq9HLViRjiQY8ng=
=TXQ0
-----END PGP SIGNATURE-----