[MERGE] Refactor diffing

Sun Nov 25 20:45:31 GMT 2007

On Sun, 2007-11-25 at 12:53 -0500, Aaron Bentley wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Robert Collins wrote:
> > On Sat, 2007-11-24 at 12:05 -0500, Aaron Bentley wrote:
> >> Will you work on getting these DirstateRevisionTree improvements merged
> >> or should I?
> > 
> > I think you should merge them with the new Diff code, now that we have a
> > concrete win in the trivial cases.
> 
> Done.  See attached.  I've also changed the extra_ parameter to be a
> list of factories.

I'm really happy with this code. The only remaining thing I think could
be improved would be for the factory signature to take DiffTree rather
than the individual trees. Then DiffKindChange can share the signature
and there's no need to special case it.

If this doesn't make sense to you, it's fine with me to leave it as it
is.

> > It seems
> > to me that conceptually it should only be called on every modified file.
> > I think that DirstaterevisionTree could sanely cache this:
> > get_file_mtime
> >    get_entry
> >    revision = self._get_revision(revid)
> >    ...
> > 
> > and _get_revision builds up a dict which is cleared when the last
> > read_lock is cleared, or perhaps even a LRUCache, though knowing how
> > much is 'reasonable' to cache is tricky.
> 
> That won't help for this workload, because the revision is retrieved
> only once, and that is once too many.  For the 250 modified files case,
> it would probably help some, but I'd have to see the specifics.

I hope that you'll find packs are cheaper at getting a single revision
because we don't parse the entire revision index.

> > I suggest this rather than extending dirstate to have a cached mtime
> > field because few operations will need the mtime (stats doesn't for
> > instance), and it will probably be highly duplicated across all files,
> > so we'll increase the IO we do for many cases. I'm not against it in
> > principle though - it might be worth a quick hack, untested, but
> > profiled, to see how it impacts status etc on e.g. mozilla trees, and
> > what it does for diff. Another possible place to cache it is in the
> > revision index, if we think its common enough.
> 
> Okay.  I'll file it as "future work" for now.  But note that even the
> 250 files case, get_file_mtime is still more expensive than iter_changes
> and unified_diff together.

Ok, so this is something we need to make substantially cheaper :). It
may be on packs FWIW.

> >> For this workload, 60.0% of time is spent retrieving files from knits.
> >> This suggests that we need
> >> 1. to get an accelerated implementation of iter_files_bytes for packs
> >> 2. to use such implementation to retrieve the texts and diff them in the
> >> order they arrive.
> > 
> > In pathological cases we won't be able to do ideal IO without buffering
> > the entire tree in memory (e.g. revtree to revtree, 50% old files 50%
> > new, no overlap), or staging on disk.
> 
> True, but for the common case, we have a working tree and a revision
> tree, and we can do iter_files_bytes on the revision tree, and get_file
> on the working tree.

Yup. As long as we don't break the uncommon case I'm entirely for
optimising the common case.

> > So I think we should balance
> > things here, or even consider hinting to iter_file_bytes that we need
> > pairs of texts (in the two history trees case, or in the merge case too
> > now I think of it). Another possible approach is to ask for batches of
> > files (we know their size so can avoid holding 10 iso's at once) etc.
> 
> These are all things that potentially reduce our IO perfection.  I'm not
> sure, but it might be worth just staging on disk.

Be worth testing I guess. With the next generation or two of packs, as
we experiment with different delta styles, I expect extraction of texts
to get much better, particularly for recent versions of files, because
we should always have a recent fulltext.

-Rob
-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20071126/a3d29392/attachment.pgp