Making diff fast (was Re: Some notes on distributed SCM)

Sun Apr 10 09:30:02 BST 2005

On Sunday 10 April 2005 02:22, Benno wrote:
> Currently this operation is quite slow:
>
> i30pc60:/Users/benno/coding/bzr_linux_test% time
> ~/coding/bzr/revert_branch/bzr status ~/coding/bzr/revert_branch/bzr status
>  19.20s user 9.86s system 56% cpu 51.855 total
>
> This is running MacOSX on a G4 1.33MHz Powerbook with 768Mb of ram. Using
> cElementTree (although that is largely irrelevant in this case).
>
> Looking at the current source, you end up having to read every file in
> working directory and performing a SHA1 on it.
>
> Some ways to improve this performance come to mind.
>
> 1/ Caching the working tree stat data, and then being able
> to simply stat each file and compare the stat information.
>
> Pros: Portable, simple to use.
> Cons: Still requies a full search of the tree which is slow.

Actually, it isn't slow at all.  First do:

   tar 2.6.11.3 >/dev/null

then:

   time du 2.6.11.3

   real    0m0.113s
   user    0m0.035s
   sys     0m0.033s

So your solution 1, caching the stat, looks like a very good idea to me.  
(Later, we will fix Linux to do the tar >/dev/null for us automatically).

Specificly, compare each file's mtime to the (persistently) cached mtime 
(don't futz with earlier vs later!) then, if it is different, compute the 
sha1 to verify the file really changed.  This is robust, though obviously you 
can fool it if you really try.  Just in case, we should be able to force the 
full sha1 tree traversal.

Martin, the sha1 as a means of looking up file versions quickly is a _cache_ 
entity, not a canonical data structure.  The canonical data structure should 
be Matt's logs, and almost everything else is throwaway/rebuildable.  This 
has huge benefits:

  - By thowing away all the cached accelerators (sha1's, full file texts, full
    directory lists, indexes of all flavors) you can shrink your cache with
    one wave of the wand

  - In order to _be sure_ we will never lose any data, we only need to audit
    the log files

  - Append-only for the rev data is fundamentally beautiful.  For example,
    we can trivially implement a "revert repository to any given point in
    time" just by including a change number in every log entry, and keeping
    a log of (changenumber:timedate) pairs.

  - Probably other nice things...

Matt is right.  Doing it all with append-only logs is clearly the best 
approach.  A fileid should just be a sequentially assigned number, not a sha1  
As Matt points out, a fileid is like an inode.  You log every attribute that 
you'd find in an inode - file data, permissions, mtime, extended attributes - 
to the fileid's verlog, and you log the (filename, fileid) pair to the 
directory's verlog.

OK, now I'll jump straight into a nontrivial subject: foreign repository 
object correspondence.

Any time somebody clones a repository, all the fileid's up to the point of the 
clone will match exactly.  Past the clone point, things get interesting.  Two 
identical repositories (i.e., just after a clone) might each pull the same 
changeset from a third repository.  Everything still matches exactly, but we 
need fancier bookkeeping to know that.  A slightly improved fileid does the 
trick:

   fileid = (repository-number:file-number)

where the repository part is also just a counter, which counts all the foreign 
repositories we have ever pulled from.  These repository numbers are strictly 
internal.  We map an internal repository number to/from somebody's "public" 
repository uuid with a table.  This way, we can always establish an exact 
taxonomy of all objects that anybody ever imported from each other.

When two sibling repositories each import the same third-party tarball, things 
get more interesting.  In this case we have to guess a little, but almost all 
the time, we ought to still be able to come up with an exact correspondence 
between objects in the two sibling repositories.  We can leave this as 
"further work".

Regards,

Daniel