[MERGE] Refactor diffing

Robert Collins robertc at robertcollins.net
Sat Nov 24 20:31:36 GMT 2007


On Sat, 2007-11-24 at 12:05 -0500, Aaron Bentley wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Robert Collins wrote:
> > On Fri, 2007-11-23 at 18:19 -0500, Aaron Bentley wrote:
> >> Unfortunately, Dirstate.inventory is also used by DirState.kind and
> >> DirState.get_file_lines.
> > 
> >     def kind(self, file_id,
> >         _kind_map=dirstate.DirState._minikind_to_kind):
> 
> Thanks again.
> 
> Will you work on getting these DirstateRevisionTree improvements merged
> or should I?

I think you should merge them with the new Diff code, now that we have a
concrete win in the trivial cases.

> With these applied, I do show some improvement in the new code.  Best
> times out of 10:
...
> Or 1.7x as fast.  (most times for the new code were more like .6,
> though.  It varied a lot.)
> 
> For this workload (1 modified file, Bazaar-sized tree), an even bigger
> win would be to accelerate get_file_mtime on DirStateRevisionTree (i.e.
> by storing or caching the mtime in the dirstate so we don't have to read
> a revision per file). That revision access is 35% of runtime according
> to lsprof; _iter_changes itself only takes 23%.  unified_diff is 0.69%

is get_file_mtime called on every modified file, or every file? It seems
to me that conceptually it should only be called on every modified file.
I think that DirstaterevisionTree could sanely cache this:
get_file_mtime
   get_entry
   revision = self._get_revision(revid)
   ...

and _get_revision builds up a dict which is cleared when the last
read_lock is cleared, or perhaps even a LRUCache, though knowing how
much is 'reasonable' to cache is tricky.

I suggest this rather than extending dirstate to have a cached mtime
field because few operations will need the mtime (stats doesn't for
instance), and it will probably be highly duplicated across all files,
so we'll increase the IO we do for many cases. I'm not against it in
principle though - it might be worth a quick hack, untested, but
profiled, to see how it impacts status etc on e.g. mozilla trees, and
what it does for diff. Another possible place to cache it is in the
revision index, if we think its common enough.

> I also ran tests on a tree with 249 modified files (best times out of 3):

...

> For this workload, 60.0% of time is spent retrieving files from knits.
> This suggests that we need
> 1. to get an accelerated implementation of iter_files_bytes for packs
> 2. to use such implementation to retrieve the texts and diff them in the
> order they arrive.

In pathological cases we won't be able to do ideal IO without buffering
the entire tree in memory (e.g. revtree to revtree, 50% old files 50%
new, no overlap), or staging on disk. So I think we should balance
things here, or even consider hinting to iter_file_bytes that we need
pairs of texts (in the two history trees case, or in the merge case too
now I think of it). Another possible approach is to ask for batches of
files (we know their size so can avoid holding 10 iso's at once) etc.

Were you profiling with a pack repo or a knit repo?

> For this workload, get_file_mtime cost 12.1%, due to retrieving 250
> revisions.
> 
> _iter_changes is 2.17%.  unified_diff is 4.07%.  Those constitute the
> "real work" that diff does, but they aren't even close to being limiting
> factors.

Yup.

> Also, I noticed that dirstate uses "xxxxxxxx..." for the last-revision
> of the working tree.  Something ending with a colon would probably be
> better, because afaik, "xxxxx..." is actually a valid revision-id.  I
> think a great choice would be "current:", which is a reserved id and
> actually means "the current state of the working tree".

We never read that field, so '' is possibly better, however yes, I agree
that current: might be clearer for manual inspection.

-Rob
-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20071125/d106c220/attachment.pgp 


More information about the bazaar mailing list