Making diff fast (was Re: Some notes on distributed SCM)

Matt Mackall mpm at
Sun Apr 10 08:25:09 BST 2005

On Sun, Apr 10, 2005 at 04:22:21PM +1000, Benno wrote:
> On Sat Apr 09, 2005 at 21:28:22 -0700, Matt Mackall wrote:
> >(Martin and Daniel, this is slightly updated)
> >  commit                 O(changed files)
> >  revert                 O(changed files)
> >  diff                   O(changed files)
>    status                 O(changed files)
> Just for completenesss.

An excellent point.
> I think performing the above operations quickly is quite important.
> Currently this operation is quite slow:
> i30pc60:/Users/benno/coding/bzr_linux_test% time ~/coding/bzr/revert_branch/bzr status
> ~/coding/bzr/revert_branch/bzr status  19.20s user 9.86s system 56% cpu 51.855 total
> This is running MacOSX on a G4 1.33MHz Powerbook with 768Mb of ram. Using
> cElementTree (although that is largely irrelevant in this case).
> Looking at the current source, you end up having to read every file in 
> working directory and performing a SHA1 on it.
> Some ways to improve this performance come to mind.
> 1/ Caching the working tree stat data, and then being able
> to simply stat each file and compare the stat information.
> Pros: Portable, simple to use.
> Cons: Still requies a full search of the tree which is slow.

Unless the working set fits in cache, in which case it's quite fast
again. Even so, for 'time find | wc' on the Linux kernel tree

cold cache: 4.5s
hot cache: .045s.
> 2/ Using dnotify or similar to have a notification of when files
> are changed, so that you explicitly know this information without a
> directory search.
> Pros: Have available an exact set of files which are changed without
> doing a directory scan.
> Cons: Not so portable.

And requires having a daemon.

> 3/ Create the working tree read only, and provide explicit
> commands eg: bzr edit, to modify a file.
> Pros: portable, exact list of files changed.
> Cons: makes it harder for the user. Requires special tools for e.g: patch.
> I don't think any of the above are clear general winners. I thnk I could
> handle the restrictions of 3/ if it gives me good performance, however for
> small trees where this doesn't matter it could get annoying.

I think we should aim for 1 but bear 3 in mind as we go.

Mathematics is the supreme nostalgia of our time.

More information about the bazaar mailing list