[RFC] Improving knit extraction speed

Mon Jan 29 22:33:36 GMT 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I've been looking at our knit code, and I think I've come across 2
things that we could be doing better.

For the Mozilla source tree, it has 54,763 lines in an inventory. When
we extract this from a knit, we read a full-text, and then apply a bunch
of deltas to it (almost always 200 in the case of the moz tree, but I
was also thinking about bumping that up to 500 for better storage size).

Anyway, our current "extract full + apply deltas" is inefficient for 2
reasons:

1) It caches every intermediate step, because it may be extracting
multiple targets at the same time. This means that it has to create a
copy of the list for each intermediate value.

And I checked, this really is quite slow:
$ python -m timeit \
    -s "lines = open('HEAD/.bzr/checkout/inventory').readlines()" \
    "lines = lines[:]"
100 loops, best of 3: 11.1 msec per loop

So to copy a ~55,000 entry list takes about 10ms. And if you have 100
(average if you cache every 200) it takes 1s to extract a single file
text. This is not time spent applying deltas, or interpreting the texts,
etc. This is *just* the time spent creating a new copy of the list,
because we don't know if we might be using the old one.

2) It applies the deltas as we go, which means modifying an inefficient
list representation. Rather than building up what needs to be changed,
and then modifying it in a single pass. Python list objects are similar
to std::vectors. In that they store everything in a linear array, rather
than as a linked-list representation (std::list).

If I'm reading the code correctly, we are actually doing *another* copy
of the lines as part of _apply_delta (lines = list(lines) creates a copy).

Not to mention the fact that our code for applying a delta does:

        for start, end, count, delta_lines in delta:
            lines[offset+start:offset+end] = delta_lines
            offset = offset + (start - end) + count

Which means that for each hunk we have to re-allocate the list.

If, on the other hand, we had already sorted out what lines needed to
come from where, we could do something more like:

new_lines = []
cur = 0
for start, end, count, delta_lines in delta:
  new_lines.extend(orig_lines[cur:start])
  new_lines.extend(delta_lines)
  cur = ? (something like cur + (start - end) + count)

More importantly, though, we could use a more efficient data structure
for tracking what needs to be applied, build that up over all 200
entries, and then build up the final text 1 time, rather than having to
work in a 55k list all the time.

3) There is a small 3rd issue which is that the current code reads all
records up front, and keeps them in memory. Which means that if
get_lines_list() is handling multiple texts, it may end up storing
several full-texts in memory at the same time.

I think we can do a lot better by implementing a lazy-ish intelligent
reader. Sort of how a lot of readv() calls are implemented. (return
everything in the requested order, but cache things that you are going
to need later if you come across them early.)

You might be able to do even better by collapsing some of the ranges,
but then you get back into the tradeoff of how much you want cached in
memory before you start returning data and expunging it from the cache.

Things are complicated by the fact that _get_content_maps (and
get_lines_list as the front-end) can be requested to return multiple
revisions at the same time (for performance reasons). And when building
up multiple revisions, you don't really want to have to re-build the
same revision multiple times because it was used in multiple paths. So
my idea is to take the requested revisions, and their graph of nodes you
have to use to build up. And then determine what are the "important"
nodes in that graph. (ones that are used more than once).

Then you start at the bottom, and build up each of these nodes until you
have created all of the final content.

In the future, I think there is 1 more optimization which we should
explore. And that is being able to build up the texts and ignore the
annotations. I found for "iter_lines_added_or_present_in_versions()"
being able to call KnitContent.get_fulltext_content() was quite a bit
faster than 'parse_fulltext()' because it didn't have to worry about the
annotations. Specifically for something like inventory, all of the
annotations are artificial, so not creating artificial data was a
measurable savings.

_get_content_maps() doesn't have that luxury, because it is serving a
dual purpose. One is to get the actual text, and the other is to get the
annotated text.

I've written up a draft of how I think it should work, and I'm working
on the graph algorithm to let us avoid copying so many lines.

I was wondering if anyone had a good idea about how to do patch
composition, so that we don't have to repeatedly modify a large array. I
was thinking about possibly applying-as-we-go. But transforming it
(slowly) into a linked-list. So that nodes would be short-lists. That
may provide the best trade-off between large-list manipulation and
random access time. (You really don't want to iterate through 50,000
nodes to get to node 50,001 and modify it).

I was also wondering if we could store some sort of "graph"
representation for the patches, so that you could apply everything in 1
pass.

Thoughts?

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFvnY/JdeBCYSNAAMRAlrdAJ9M2G5wkOWyesy9bMqD/ZX6ShBKygCfWte/
MJbkHIxc2Eiw5Doml7+zQjk=
=Q+Oh
-----END PGP SIGNATURE-----