[MERGE] Refactor diffing

Aaron Bentley aaron.bentley at utoronto.ca
Sat Nov 24 17:05:12 GMT 2007


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Collins wrote:
> On Fri, 2007-11-23 at 18:19 -0500, Aaron Bentley wrote:
>> Unfortunately, Dirstate.inventory is also used by DirState.kind and
>> DirState.get_file_lines.
> 
>     def kind(self, file_id,
>         _kind_map=dirstate.DirState._minikind_to_kind):

Thanks again.

Will you work on getting these DirstateRevisionTree improvements merged
or should I?

With these applied, I do show some improvement in the new code.  Best
times out of 10:

Old:
real    0m0.794s
user    0m0.644s
sys     0m0.116s

New:
real    0m0.477s
user    0m0.368s
sys     0m0.100s

Or 1.7x as fast.  (most times for the new code were more like .6,
though.  It varied a lot.)

For this workload (1 modified file, Bazaar-sized tree), an even bigger
win would be to accelerate get_file_mtime on DirStateRevisionTree (i.e.
by storing or caching the mtime in the dirstate so we don't have to read
a revision per file). That revision access is 35% of runtime according
to lsprof; _iter_changes itself only takes 23%.  unified_diff is 0.69%

I also ran tests on a tree with 249 modified files (best times out of 3):

Old:
real    0m2.804s
user    0m2.552s
sys     0m0.140s

New:
real    0m2.652s
user    0m2.420s
sys     0m0.172s

Or 1.05x as fast.

For this workload, 60.0% of time is spent retrieving files from knits.
This suggests that we need
1. to get an accelerated implementation of iter_files_bytes for packs
2. to use such implementation to retrieve the texts and diff them in the
order they arrive.

For this workload, get_file_mtime cost 12.1%, due to retrieving 250
revisions.

_iter_changes is 2.17%.  unified_diff is 4.07%.  Those constitute the
"real work" that diff does, but they aren't even close to being limiting
factors.

Also, I noticed that dirstate uses "xxxxxxxx..." for the last-revision
of the working tree.  Something ending with a colon would probably be
better, because afaik, "xxxxx..." is actually a valid revision-id.  I
think a great choice would be "current:", which is a reserved id and
actually means "the current state of the working tree".

Aaron
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHSFnI0F+nu1YWqI0RArBxAJ4gA5J1MScOvttNHtfoK1Z5Z36SjACePqc+
+gfcBgA5aYlqFvYPApW6ARQ=
=CzVE
-----END PGP SIGNATURE-----



More information about the bazaar mailing list