Making diff fast (was Re: Some notes on distributed SCM)

Benno benjl at cse.unsw.edu.au
Sun Apr 10 07:22:21 BST 2005


On Sat Apr 09, 2005 at 21:28:22 -0700, Matt Mackall wrote:
>(Martin and Daniel, this is slightly updated)
>  commit                 O(changed files)
>  revert                 O(changed files)
>  diff                   O(changed files)
   status                 O(changed files)

Just for completenesss.

I think performing the above operations quickly is quite important.
Currently this operation is quite slow:

i30pc60:/Users/benno/coding/bzr_linux_test% time ~/coding/bzr/revert_branch/bzr status
~/coding/bzr/revert_branch/bzr status  19.20s user 9.86s system 56% cpu 51.855 total

This is running MacOSX on a G4 1.33MHz Powerbook with 768Mb of ram. Using
cElementTree (although that is largely irrelevant in this case).

Looking at the current source, you end up having to read every file in 
working directory and performing a SHA1 on it.

Some ways to improve this performance come to mind.

1/ Caching the working tree stat data, and then being able
to simply stat each file and compare the stat information.

Pros: Portable, simple to use.
Cons: Still requies a full search of the tree which is slow.

2/ Using dnotify or similar to have a notification of when files
are changed, so that you explicitly know this information without a
directory search.

Pros: Have available an exact set of files which are changed without
doing a directory scan.
Cons: Not so portable.

3/ Create the working tree read only, and provide explicit
commands eg: bzr edit, to modify a file.

Pros: portable, exact list of files changed.
Cons: makes it harder for the user. Requires special tools for e.g: patch.

I don't think any of the above are clear general winners. I thnk I could
handle the restrictions of 3/ if it gives me good performance, however for
small trees where this doesn't matter it could get annoying.

I propose being able to have multiple different workding directory formats
so the user can make the choice, probable with 1/ as the sane default.

This means having some file indicating the working directory. I think
it might make sense to have a .bzr_wd, to store data about the working
direcotry and leave .bzr as things for the repository itself. So .bzr_wd
would store for example a stat cache for 1/ or a list of modified files
in 3/, also you would probably move the "inventory" file from .bzr to
.bzr_wd.

Any comments? I have a 12 hour plane flight coming up where I plan to start
implementing some of this just to see if it would work.


Cheers,

Benno




More information about the bazaar mailing list