Making diff fast (was Re: Some notes on distributed SCM)
phillips at istop.com
Sun Apr 10 09:30:02 BST 2005
On Sunday 10 April 2005 02:22, Benno wrote:
> Currently this operation is quite slow:
> i30pc60:/Users/benno/coding/bzr_linux_test% time
> ~/coding/bzr/revert_branch/bzr status ~/coding/bzr/revert_branch/bzr status
> 19.20s user 9.86s system 56% cpu 51.855 total
> This is running MacOSX on a G4 1.33MHz Powerbook with 768Mb of ram. Using
> cElementTree (although that is largely irrelevant in this case).
> Looking at the current source, you end up having to read every file in
> working directory and performing a SHA1 on it.
> Some ways to improve this performance come to mind.
> 1/ Caching the working tree stat data, and then being able
> to simply stat each file and compare the stat information.
> Pros: Portable, simple to use.
> Cons: Still requies a full search of the tree which is slow.
Actually, it isn't slow at all. First do:
tar 188.8.131.52 >/dev/null
time du 184.108.40.206
So your solution 1, caching the stat, looks like a very good idea to me.
(Later, we will fix Linux to do the tar >/dev/null for us automatically).
Specificly, compare each file's mtime to the (persistently) cached mtime
(don't futz with earlier vs later!) then, if it is different, compute the
sha1 to verify the file really changed. This is robust, though obviously you
can fool it if you really try. Just in case, we should be able to force the
full sha1 tree traversal.
Martin, the sha1 as a means of looking up file versions quickly is a _cache_
entity, not a canonical data structure. The canonical data structure should
be Matt's logs, and almost everything else is throwaway/rebuildable. This
has huge benefits:
- By thowing away all the cached accelerators (sha1's, full file texts, full
directory lists, indexes of all flavors) you can shrink your cache with
one wave of the wand
- In order to _be sure_ we will never lose any data, we only need to audit
the log files
- Append-only for the rev data is fundamentally beautiful. For example,
we can trivially implement a "revert repository to any given point in
time" just by including a change number in every log entry, and keeping
a log of (changenumber:timedate) pairs.
- Probably other nice things...
Matt is right. Doing it all with append-only logs is clearly the best
approach. A fileid should just be a sequentially assigned number, not a sha1
As Matt points out, a fileid is like an inode. You log every attribute that
you'd find in an inode - file data, permissions, mtime, extended attributes -
to the fileid's verlog, and you log the (filename, fileid) pair to the
OK, now I'll jump straight into a nontrivial subject: foreign repository
Any time somebody clones a repository, all the fileid's up to the point of the
clone will match exactly. Past the clone point, things get interesting. Two
identical repositories (i.e., just after a clone) might each pull the same
changeset from a third repository. Everything still matches exactly, but we
need fancier bookkeeping to know that. A slightly improved fileid does the
fileid = (repository-number:file-number)
where the repository part is also just a counter, which counts all the foreign
repositories we have ever pulled from. These repository numbers are strictly
internal. We map an internal repository number to/from somebody's "public"
repository uuid with a table. This way, we can always establish an exact
taxonomy of all objects that anybody ever imported from each other.
When two sibling repositories each import the same third-party tarball, things
get more interesting. In this case we have to guess a little, but almost all
the time, we ought to still be able to come up with an exact correspondence
between objects in the two sibling repositories. We can leave this as
More information about the bazaar