Making diff fast (was Re: Some notes on distributed SCM)

Mon Apr 11 00:12:38 BST 2005

On Sunday 10 April 2005 18:07, Aaron Bentley wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> > Any time somebody clones a repository, all the fileid's up to the point
> > of the clone will match exactly.  Past the clone point, things get
> > interesting.  Two identical repositories (i.e., just after a clone) might
> > each pull the same changeset from a third repository.  Everything still
> > matches exactly, but we need fancier bookkeeping to know that.  A
> > slightly improved fileid does the trick:
> >
> >    fileid = (repository-number:file-number)
>
> Personally, I prefer the Arch approach, which is essentially to assign a
>  uuid to each file.

Why bloat up the canonical log structure unnecessarily?

>  (Of course, Tom had to go reinvent uuids first) 

Uuids have their place, but not in every log entry, imho.  It is easy to have 
a (possibly throwaway) table to translates between fileids and uuids.  This 
is a "normalized" database model.

> > where the repository part is also just a counter, which counts all the
> > foreign repositories we have ever pulled from.  These repository numbers
> > are strictly internal.  We map an internal repository number to/from
> > somebody's "public" repository uuid with a table.  This way, we can
> > always establish an exact taxonomy of all objects that anybody ever
> > imported from each other.
>
> With uuids, you get this correspondance automatically.  If the file came
> from the same ultimate source, it's treated the same in every tree that
> contains it.

And if files didn't come from the same source but are the same regardless then 
things start to get messy, and the uuid is just a confusing liability.  The 
same with SHA1, which attempts to work around this, imho.  Not to mention 
that both are considerably bulkier than a simple sequence number.

> > When two sibling repositories each import the same third-party tarball,
> > things get more interesting.  In this case we have to guess a little, but
> > almost all the time, we ought to still be able to come up with an exact
> > correspondence between objects in the two sibling repositories.
>
> In this case, the Arch model requires tables similar to the ones you
> described earlier, to map one uuid to another.

So since the tables are required anyway, let's rely on them and thereby 
normalize and shrink the database a little.

Note my schizophrenic position with regards to micro-optimizing the canonical 
verlog, vs other things.  I freely admit to that, this is the important one.

> As has been done so far in Arch :-)  This problem is only likely to
> occur when multiple people import the same well-known project.
> Canonical is fighting this by providing imports of many well-known
> projects in the baz format (which is Arch-derived).

This situation comes up _all the time_ for me, I don't know about you.  The 
busier I get, the more it comes up.

Regards,

Daniel