Making diff fast (was Re: Some notes on distributed SCM)

Mon Apr 11 00:41:01 BST 2005

On Sun, 2005-04-10 at 04:30 -0400, Daniel Phillips wrote:

> Martin, the sha1 as a means of looking up file versions quickly is a _cache_ 
> entity, not a canonical data structure.  The canonical data structure should 
> be Matt's logs, and almost everything else is throwaway/rebuildable.  This 
> has huge benefits:
> 
>   - By thowing away all the cached accelerators (sha1's, full file texts, full
>     directory lists, indexes of all flavors) you can shrink your cache with
>     one wave of the wand
> 
>   - In order to _be sure_ we will never lose any data, we only need to audit
>     the log files
> 
>   - Append-only for the rev data is fundamentally beautiful.  For example,
>     we can trivially implement a "revert repository to any given point in
>     time" just by including a change number in every log entry, and keeping
>     a log of (changenumber:timedate) pairs.
> 
>   - Probably other nice things...
> 
> Matt is right.  Doing it all with append-only logs is clearly the best 
> approach.  A fileid should just be a sequentially assigned number, not a sha1  
> As Matt points out, a fileid is like an inode.  You log every attribute that 
> you'd find in an inode - file data, permissions, mtime, extended attributes - 
> to the fileid's verlog, and you log the (filename, fileid) pair to the 
> directory's verlog.

Suppose you want to find out just a summary of the changes between two
trees -- the change in their shape, if you will.  This approach means
reading the log file of every file in both trees.  That just seems
wrong.

> OK, now I'll jump straight into a nontrivial subject: foreign repository 
> object correspondence.
> 
> Any time somebody clones a repository, all the fileid's up to the point of the 
> clone will match exactly.  Past the clone point, things get interesting.  Two 
> identical repositories (i.e., just after a clone) might each pull the same 
> changeset from a third repository.  Everything still matches exactly, but we 
> need fancier bookkeeping to know that.  A slightly improved fileid does the 
> trick:
> 
>    fileid = (repository-number:file-number)
> 
> where the repository part is also just a counter, which counts all the foreign 
> repositories we have ever pulled from.  These repository numbers are strictly 
> internal.  We map an internal repository number to/from somebody's "public" 
> repository uuid with a table.  This way, we can always establish an exact 
> taxonomy of all objects that anybody ever imported from each other.

I don't see a good reason to need to translate file ids when moving them
between repositories, when you can so easily just assign
universally-unique names and be done.  It certainly makes signing much
harder.

-- 
Martin

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20050411/d63ebb93/attachment.pgp