Making diff fast (was Re: Some notes on distributed SCM)
mbp at sourcefrog.net
Mon Apr 11 00:41:01 BST 2005
On Sun, 2005-04-10 at 04:30 -0400, Daniel Phillips wrote:
> Martin, the sha1 as a means of looking up file versions quickly is a _cache_
> entity, not a canonical data structure. The canonical data structure should
> be Matt's logs, and almost everything else is throwaway/rebuildable. This
> has huge benefits:
> - By thowing away all the cached accelerators (sha1's, full file texts, full
> directory lists, indexes of all flavors) you can shrink your cache with
> one wave of the wand
> - In order to _be sure_ we will never lose any data, we only need to audit
> the log files
> - Append-only for the rev data is fundamentally beautiful. For example,
> we can trivially implement a "revert repository to any given point in
> time" just by including a change number in every log entry, and keeping
> a log of (changenumber:timedate) pairs.
> - Probably other nice things...
> Matt is right. Doing it all with append-only logs is clearly the best
> approach. A fileid should just be a sequentially assigned number, not a sha1
> As Matt points out, a fileid is like an inode. You log every attribute that
> you'd find in an inode - file data, permissions, mtime, extended attributes -
> to the fileid's verlog, and you log the (filename, fileid) pair to the
> directory's verlog.
Suppose you want to find out just a summary of the changes between two
trees -- the change in their shape, if you will. This approach means
reading the log file of every file in both trees. That just seems
> OK, now I'll jump straight into a nontrivial subject: foreign repository
> object correspondence.
> Any time somebody clones a repository, all the fileid's up to the point of the
> clone will match exactly. Past the clone point, things get interesting. Two
> identical repositories (i.e., just after a clone) might each pull the same
> changeset from a third repository. Everything still matches exactly, but we
> need fancier bookkeeping to know that. A slightly improved fileid does the
> fileid = (repository-number:file-number)
> where the repository part is also just a counter, which counts all the foreign
> repositories we have ever pulled from. These repository numbers are strictly
> internal. We map an internal repository number to/from somebody's "public"
> repository uuid with a table. This way, we can always establish an exact
> taxonomy of all objects that anybody ever imported from each other.
I don't see a good reason to need to translate file ids when moving them
between repositories, when you can so easily just assign
universally-unique names and be done. It certainly makes signing much
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20050411/d63ebb93/attachment.pgp
More information about the bazaar