[MERGE] Faster code for pushing from packs back into knits

Tue Nov 20 22:18:35 GMT 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Aaron Bentley wrote:
> John Arbash Meinel wrote:
>> As I mentioned to Robert earlier the code we have for packs => knits
>> rebuids the complete history of every text that was modified. Very
>> painful. The attached code just changes it to add them text by text.
>> It isn't 100% optimal, because it extracts full texts one by one. However
>> I'm trying to be more sensitive towards memory consumption. (I think a lot
>> of our fetch() code buffers far too much in memory.)
> 
> bb:comment
> 
> This is very much a tradeoff.  Now that we have an LRU, we ought to be
> able to make text extraction less memory-intensive, at the cost of
> occasionally needing to regenerate old texts.
> 
> Have you looked at the performance impact on non-pack Knits or between
> packs?  It looks like this patch might reduce performance significantly
> there.

The code path isn't used.

This is only used for the path of "we have unannotated data going into an
annotated store."

> 
>>    def _text_by_text_join(self, pb, msg, version_ids,
> ignore_missing=False):
> 
> I don't find this name very intuitive.  I think _copy_texts would be
> much better.

I'm happy enough to change it. The function itself is in "join()" and it is the
alternative to using VersionedFile.join() so leaving it ...join() was meant for
consistency with the functions it is replacing.

> 
>> +        # TODO: jam 20071116 It would be nice to have a streaming interface to
>> +        #       get multiple texts from a source. The source could be smarter
>> +        #       about how it handled intermediate stages.
>> +        # TODO: jam 20071116 Consider using 'get_line_list' instead of lots of
>> +        #       calls to get_lines()
> 
> These TODO lines look contradictory at first.  Could they maybe be unified?
> 
> Also, using mpdiffs (VF.make_mpdiffs, VF.install_mpdiffs) may be more
> efficient than fulltexts, because in the single-parent case, you don't
> have to do sequence matching.
> 
> Aaron

Certainly something to consider. I have to wonder how they solve putting the
right annotated values in. Is that an explicit part of an mpdiff? (Each line
must have the revision_id associated with it?)

Actually, make_mpdiffs has the same problem of extracting all versions of all
texts before it figures out the delta. It specifically does:

        for version_id in version_ids:
            knit_versions.add(version_id)
            knit_versions.update(self.get_parents(version_id))
        lines = dict(zip(knit_versions,
            self._get_lf_split_line_list(knit_versions)))

Which I believe means it starts with "build up a set of all versions and all
parents of all those versions". And follows that up with a call to:

    def _get_lf_split_line_list(self, version_ids):
        return [StringIO(t).readlines() for t in self.get_texts(version_ids)]

Which is "extract the full text of all of these, and then put them into a
StringIO object and split them up again."

Knit.get_texts is:
    def get_texts(self, version_ids):
        return [''.join(l) for l in self.get_line_list(version_ids)]

Also, we need to be careful with code like that. I believe I've seen cases
where StringIO(t).readlines() splits on '\r'. Which is why
osutils.split_lines() is written in such a sub-optimal way.

So while "get_line_list" *might* be smart enough to not duplicate each line
that is in common. By going through "get_texts()" you *force* it to create a
new copy of all of those strings.

I haven't traced through it all, but I'm guessing we end up with several
full-text copies in memory because of the different bits of api friction.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHQ107JdeBCYSNAAMRAinHAKCa1P2qvRy+s5VK9vjptMiCQZRcbgCgpaJS
jUW6Fjp1T1Ox13uj8qyUL/U=
=+dys
-----END PGP SIGNATURE-----