[MERGE] Refactor diffing

Aaron Bentley aaron.bentley at utoronto.ca
Sun Nov 25 17:53:33 GMT 2007


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Collins wrote:
> On Sat, 2007-11-24 at 12:05 -0500, Aaron Bentley wrote:
>> Will you work on getting these DirstateRevisionTree improvements merged
>> or should I?
> 
> I think you should merge them with the new Diff code, now that we have a
> concrete win in the trivial cases.

Done.  See attached.  I've also changed the extra_ parameter to be a
list of factories.

>> an even bigger
>> win would be to accelerate get_file_mtime on DirStateRevisionTree (i.e.
>> by storing or caching the mtime in the dirstate so we don't have to read
>> a revision per file). That revision access is 35% of runtime according
>> to lsprof; _iter_changes itself only takes 23%.  unified_diff is 0.69%
> 
> is get_file_mtime called on every modified file, or every file?

Every modified file.  It's used for producing the unified diff.

> It seems
> to me that conceptually it should only be called on every modified file.
> I think that DirstaterevisionTree could sanely cache this:
> get_file_mtime
>    get_entry
>    revision = self._get_revision(revid)
>    ...
> 
> and _get_revision builds up a dict which is cleared when the last
> read_lock is cleared, or perhaps even a LRUCache, though knowing how
> much is 'reasonable' to cache is tricky.

That won't help for this workload, because the revision is retrieved
only once, and that is once too many.  For the 250 modified files case,
it would probably help some, but I'd have to see the specifics.

> I suggest this rather than extending dirstate to have a cached mtime
> field because few operations will need the mtime (stats doesn't for
> instance), and it will probably be highly duplicated across all files,
> so we'll increase the IO we do for many cases. I'm not against it in
> principle though - it might be worth a quick hack, untested, but
> profiled, to see how it impacts status etc on e.g. mozilla trees, and
> what it does for diff. Another possible place to cache it is in the
> revision index, if we think its common enough.

Okay.  I'll file it as "future work" for now.  But note that even the
250 files case, get_file_mtime is still more expensive than iter_changes
and unified_diff together.

>> For this workload, 60.0% of time is spent retrieving files from knits.
>> This suggests that we need
>> 1. to get an accelerated implementation of iter_files_bytes for packs
>> 2. to use such implementation to retrieve the texts and diff them in the
>> order they arrive.
> 
> In pathological cases we won't be able to do ideal IO without buffering
> the entire tree in memory (e.g. revtree to revtree, 50% old files 50%
> new, no overlap), or staging on disk.

True, but for the common case, we have a working tree and a revision
tree, and we can do iter_files_bytes on the revision tree, and get_file
on the working tree.

> So I think we should balance
> things here, or even consider hinting to iter_file_bytes that we need
> pairs of texts (in the two history trees case, or in the merge case too
> now I think of it). Another possible approach is to ask for batches of
> files (we know their size so can avoid holding 10 iso's at once) etc.

These are all things that potentially reduce our IO perfection.  I'm not
sure, but it might be worth just staging on disk.

> Were you profiling with a pack repo or a knit repo?

knit.

>> Also, I noticed that dirstate uses "xxxxxxxx..."

> We never read that field, so '' is possibly better, however yes, I agree
> that current: might be clearer for manual inspection.

We certainly shouldn't read the field, but in case of programming error,
I think "current:" would be clearest.

Aaron
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHSbad0F+nu1YWqI0RAqgNAJ9lNB6s7jCUZqiTDPMnSoFhmafIywCeMJ9K
JNmYJDouxNlR5AolICbFkaU=
=NYyh
-----END PGP SIGNATURE-----
-------------- next part --------------
A non-text attachment was scrubbed...
Name: refactor-diff-5.patch
Type: text/x-diff
Size: 53616 bytes
Desc: not available
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20071125/d79e1b91/attachment-0001.bin 


More information about the bazaar mailing list