[MERGE] Faster code for pushing from packs back into knits

Thu Nov 22 19:30:19 GMT 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

John Arbash Meinel wrote:
> Aaron Bentley wrote:
>> John Arbash Meinel wrote:
>
>>> +        # TODO: jam 20071116 It would be nice to have a streaming
interface to
>>> +        #       get multiple texts from a source.

>> Actually, we have that: Repository.iter_files_bytes.

> Thanks for the pointer. I still want to avoid anything which ends up
extracting
> and holding 100 copies of a file in memory before putting it into the
target.

That's nothing to do with the interface, just the knit implementation.
The packs implementation ought to be fearsomely efficient.

>> Also, using mpdiffs (VF.make_mpdiffs, VF.install_mpdiffs) may be more
>> efficient than fulltexts, because in the single-parent case, you don't
>> have to do sequence matching.

> Certainly something to consider. I have to wonder how they solve putting the
> right annotated values in. Is that an explicit part of an mpdiff? (Each line
> must have the revision_id associated with it?)

No, the underlying code converts the mpdiff into a fulltext, then adds
it, supplying the precomputed lefthand-parent matching-blocks output.
(the matching-blocks output is implicit in the mpdiff format)

> Actually, make_mpdiffs has the same problem of extracting all versions of all
> texts before it figures out the delta. It specifically does:

> Which I believe means it starts with "build up a set of all versions and all
> parents of all those versions".

 And follows that up with a call to:
> 
>     def _get_lf_split_line_list(self, version_ids):
>         return [StringIO(t).readlines() for t in self.get_texts(version_ids)]
> 
> Which is "extract the full text of all of these, and then put them into a
> StringIO object and split them up again."
> 
> Knit.get_texts is:
>     def get_texts(self, version_ids):
>         return [''.join(l) for l in self.get_line_list(version_ids)]
> 
> Also, we need to be careful with code like that. I believe I've seen cases
> where StringIO(t).readlines() splits on '\r'. Which is why
> osutils.split_lines() is written in such a sub-optimal way.

I have tested over the course of implementing bundles.  I would be
surprised if it behaved as you suggest, because I was using
str.splitlines, which did behave that way.  This caused severe problems,
which went away when I switched to StringIO.readlines().

> So while "get_line_list" *might* be smart enough to not duplicate each line
> that is in common. By going through "get_texts()" you *force* it to create a
> new copy of all of those strings.

Yes.  This is because knits cannot be trusted to split lines properly.

> I haven't traced through it all, but I'm guessing we end up with several
> full-text copies in memory because of the different bits of api friction.

Right.  And your current implementation is fine to go in.  But we should
work on fixing the implementations of these other things, so that we can
ultimately switch to them.

Aaron
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHRdjL0F+nu1YWqI0RAtU1AJ9L1t5xvqcb4CNdq7c6pGnQXF9kTACeK2k8
j4rWpFZ6BORnaSitjB5EW4k=
=eE12
-----END PGP SIGNATURE-----