Making diff fast (was Re: Some notes on distributed SCM)

Daniel Phillips phillips at
Mon Apr 11 01:18:17 BST 2005

On Sunday 10 April 2005 19:41, Martin Pool wrote:
> On Sun, 2005-04-10 at 04:30 -0400, Daniel Phillips wrote:
> > Martin, the sha1 as a means of looking up file versions quickly is a
> > _cache_ entity, not a canonical data structure.  The canonical data
> > structure should be Matt's logs, and almost everything else is
> > throwaway/rebuildable.  This has huge benefits:
> >
> >   - By thowing away all the cached accelerators (sha1's, full file texts,
> > full directory lists, indexes of all flavors) you can shrink your cache
> > with one wave of the wand
> >
> >   - In order to _be sure_ we will never lose any data, we only need to
> > audit the log files
> >
> >   - Append-only for the rev data is fundamentally beautiful.  For
> > example, we can trivially implement a "revert repository to any given
> > point in time" just by including a change number in every log entry, and
> > keeping a log of (changenumber:timedate) pairs.
> >
> >   - Probably other nice things...
> >
> > Matt is right.  Doing it all with append-only logs is clearly the best
> > approach.  A fileid should just be a sequentially assigned number, not a
> > sha1 As Matt points out, a fileid is like an inode.  You log every
> > attribute that you'd find in an inode - file data, permissions, mtime,
> > extended attributes - to the fileid's verlog, and you log the (filename,
> > fileid) pair to the directory's verlog.
> Suppose you want to find out just a summary of the changes between two
> trees -- the change in their shape, if you will.  This approach means
> reading the log file of every file in both trees.  That just seems
> wrong.

So you don't do that.  What you do is, you maintain your manifest and indexes 
etc (sorry if I got the terminology wrong on my first attempt) as _persistent 
cache_  objects that accelerate this operation.  Any time you want, you can 
throw away the manifest and rebuild it from the verlogs.  You would seldom 
need to do that, but you could if you had to.

> I don't see a good reason to need to translate file ids when moving them
> between repositories, when you can so easily just assign
> universally-unique names and be done.  It certainly makes signing much
> harder.

I jumped _way_ too far ahead.  Let's stay focussed on the simple stuff for 



More information about the bazaar mailing list