revfile developments

Sun Apr 10 05:12:09 BST 2005

On Sun, Apr 10, 2005 at 09:55:01AM +1000, Martin Pool wrote:
> Hi Matt,
> 
> Thanks for letting me use your revfile code.
> 
> Here are the changes I made:
> 
> http://bazaar-ng.org/bzr/bzr.revfile/bzrlib/mdiff.py
> http://bazaar-ng.org/bzr/bzr.revfile/bzrlib/revfile.py
> 
> At first I tried doing a byte-by-byte diff, but that turns out to be too
> slow, as you probably know.  I fixed a bug in the linesplit()
> function.  

Yeah, found that too. I did a checkout from bkcvs of the ~500 Makefile revs
and checked them in and discovered a few things. Checking a revision into
revfile is about 100 times faster than checking a revision out of CVS.

> There are two small optimizations to avoid storing a diff or avoid doing
> gzip if they wouldn't win.

I was planning to replace factor with something that basically ensures
that the data needed to reconstruct a rev is never more than say 2x
the length of the original file.

> I think it's important to be able to have branching within the storage
> of a single file, so I added that.

Ok, I'll look at that. I don't think it's necessary though.

> Although I index by SHA-1, I don't make the mistake of Monotone of
> assuming that two objects with the same content are the same thing.
> There is a higher-level inventory and revision object that just uses
> revfile as a content-addressible store.  I need to use something more
> than just an integer to identify revisions because it's too hard to keep
> simple integers in sync in a distributed system.  A nice side effect is
> that we can easily check we're getting out the text we meant to put in.

Haven't convinced myself of what's needed here. See my notes I just
sent out. My hope is to have everything but branch-ids and
changeset-ids be local.

Handling rename is a little annoying and suggests needing UUIDs per
file (but not per revision), but that might be dealt with by simply
having each changeset point to (or be) a pointer to the toplevel
directory revision and that recursively references all the changes.

> Each delta has a "base" pointer saying which previous text it's stored
> relative to.  The base pointer doesn't have any meaning to the revision
> control layer; it's just for delta compression.  This could be
> manipulated to do some kind of skip-deltas to avoid ever needing to
> store the full text, but I don't do that for now.

Yeah, there are a bunch of things that can be done here.

-- 
Mathematics is the supreme nostalgia of our time.