[RFC] Improving knit extraction speed

Robert Collins robertc at robertcollins.net
Thu Feb 1 11:59:34 GMT 2007


On Mon, 2007-01-29 at 16:33 -0600, John Arbash Meinel wrote:
> 
> I was also wondering if we could store some sort of "graph"
> representation for the patches, so that you could apply everything in
> 1
> pass. 

so we have a sequence of operations that look like:

RANGE, RANGE, NEWVALUE

with a fulltext being
0,0 RANGE NEWVALUE

We want to accumulate the data needed to output the final text, without
having to manipulate a lot of vectors on the way through.

that means we want to be able to do:
gather patches
output = [None] * length-of-final
loop-on-patches assigning to output.

but what if we invert that?

when we read a patch we dont know its final range, because it can be
offset by later patches. If we read the later patches first, we could
know that.

So, if we read patches 0..N as N..0, we can build a mapping:

[sketch follows, paging in what I did for baz annotate here]

now, imagine we have a mapping list of ranges and offsets:
START, LENGTH, OFFSET
sorted by start, we can use bisect + linear walking to process a new
patch N into this list, and into the output table at the same time.
START is the starting point in output of this fragment in historical
coordinates.
LENGTH is the length of the fragment
OFFSET is the offset at the current point in history to adjust a range
to the output coordinates.
That is, START + OFFSET is where the content starts in output.

That is, to process a patch:
OLDFROM, OLDLENGTH, NEWFROM, NEWLENGTH, CONTENT
(the old files start offset, and length for this patch, and the new
offset and length. Note that the new offset will always match the old
one for a single change to a file - and it only varies by the number of
inserts/deletes done by earlier hunks of the same logical patch.)

we have two steps to perform - insert content into output:
1) update the output map with the new content:
find the first START, LENGTH, OFFSET in mapping <= OLDFROM
 (we use oldfrom, because we keep the mapping up to date, and thus every
patch is like the first patch we are seeing).
subtract START from the next mapping entries START to obtain the maximum
number of lines under consideration, MAXIMUM.
split the patch into two at MAXIMUM - if MAXIMUM is >= NEWLENGTH, no
split is required.
subtract length from NEWLENGTH, to obtain the number of lines of this
patch that are live in the output, LIVELINES
copy from CONTENT[NEWFROM + OFFSET] for LIVELINES into output.
add LIVELINES to LENGTH in the mapping.
if LENGTH == MAXIMUM, combine this mapping entry with the next one.
2) update the offset map
walk the mapping list from this point to the end,
NEWLENGTH - OLDLENGTH is the coordinate difference to record - 
subtract that from START in every mapping, and add it to offset.

rinse and repeat until we hit a fulltext. 
It would be nice if we knew the final file line count at the start, we
could preallocate output, and stop as soon as the mapping file has a
single hunk describing all of output.

improvements to be made: having to update all the offsets is a N^2 worst
case algorithm, where N is the number of lines in the file. In the best
case its N - where there is a single mapping entry the whole time. In
the worst case - where there is only 1 line filled in with each patch,
and they don't coalesce at all until the last linecount/2 patches - its
linecount/2 + linecount -2 + linecount -4 + linecount -6 .. linecount
-linecount => linecount^2. 

Its conceptually possible to ameliorate this by having a fancy
incremental-update-of-offsets algorithm, using a generation counter -
this would one only update low-in-the-file mapping offsets when a
low-in-the-file patch was recieved. That is, bisect into the mapping.
Grab the generation counter from that mapping entry, look it up in a
dict of generation counters, which tells you which entry you need to
start at to update, and update from there to the current mapping entry.
I'm not sure of the algorithmic cost when doing this.

This was very fast at generating full texts from unidiff hunks inside
baz.

-Rob


-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20070201/0cbd852c/attachment.pgp 


More information about the bazaar mailing list