Overlay repositories

Fri Sep 22 23:53:41 BST 2006

Steve Alexander wrote:
> I want to propose a new feature: Overlay repositories.  That is, a
> repository can point to other repositories, and bzr will use this
> pointer to find the revision data it needs.
> 
> Why do I want such a feature?
> 
> I'm one of the people who develops the Launchpad software.  Our
> development process goes a bit like this.
> 
> We have a PQM that manages a collection of branches called "rocketfuel".
>  I can read all the rocketfuel branches, but not write to them.  One of
> the rocketfuel branches, devel, is the code mainline.
> 
> To work on a feature, I do the following.
> 
> 1. I make my own branch, on my laptop, from rocketfuel's mainline.
> 2. I do some work, merging from mainline from time to time.
> 3. I push my branch onto a server so that PQM and also other
>    developers can see that branch.
> 4. Eventually, I ask PQM to merge my branch into mainline.
> 
> On the server, the PQM branches are in a repository, and all my branches
> are in a repository.  Each developer has his own repository that only
> they can write to.
> 
> This process works well, but there are a couple of problems.
> 
>  - I end up pushing data to the server that already exists on the
> server.  That is, data for revisions I have merged from the PQM managed
> branch, which I then push into my repository.  This makes pushes take
> longer.
> 
>  - There is more data stored on the server than there needs to be.  Say
> we have ten developers.  That means there are ten copies of the PQM
> branch's history on the server when there really need be only one.

Just a couple of brainstorm ideas...

1) HistoryHorizons would give you less total data stored on the server.
Because the branches you push to wouldn't have to have the complete
project history. Currently HH tries to fill in ghosts if it comes across
them, it just isn't as greedy as the current algorithm which always
fills in all ghosts. So you would still end up pushing more than you need.

2) If 'sftp' supported remote copy, we could try to copy some bytes from
a remote location to a remote location. Unfortunately, I haven't seen
anyway to copy other than downloading one file, and writing to another.
Does anyone know of a remote read/write for sftp? You have 2 file
handles, it seems like all you need is a command that takes both handles
and tells it what bytes to read from one file, and write to the other file.

This would help prevent you from having to push more data than you need
to, because some things could be found on the remote machine.

3) Because of sftp limitations, this may just be best handled my the
smart server. It gets a little tricky, because you have to start
negotiating what revisions need to be copied, and the smart server needs
to know where it might find versions that are not in the primary storage
location. This is far removed from the 0.11 smart server, but might be
something we could eventually support.

I do think Aaron has a point, that the more we have history punching,
the more likely it starts to become that a given repository can't
actually function on its own. Hopefully we would avoid the Arch case,
where because cachedrevs were a separate concept, it was possible to
mirror an archive and not get any cached revs, and then your mirror was
completely useless because it didn't go through enough other archives to
get to one that had the original import.

What about a different concept. More the idea of 'Packed' history, *kind
of* like what darcs and git have.

The thought I have in mind is that you could have a single large file,
that could contain the information for a bunch of revisions. (Say 100,
1000, whatever). This file is coupled with an index file, which says
where the contents for what revisions can be found. Then we could have
something like:

.bzr/repository/
    packed/
        index
        revs-XXXX
        revs-YYYY
        ...

The index is just a pointer from revision-id => revs-XXXX file that
contains the info for that revision.
The revs-* files could contain an internal index at the beginning, and
then a bunch of compressed history information.

This lets you split the index information up a bit, so that it scales
reasonably well. You still are O(Nrevisions) but the constant factor is
small. Getting the contents for a specific revision then becomes reading
the global index file, then looking inside the corresponding revs-XXX file.

The final piece is that 'packed' could be easily replaced by a symlink
to a read-only directory. It becomes a location for long-term archival.
From the outside, it still looks like all of the history is contained
inside the repository.

Then there could be a few commands for handling this data. You need
something to say 'go generate a packed file'. And something to say 'for
anything in a packed file, remove it from the live knit files'.

One nice this is that 'go generate' actually maintains our constraints,
since the index can be append-only, and each revs-XXX file is
self-contained. Just plopping a new file in there is also transaction
safe. The 'remove things from live knit files' is not readonly safe, so
you would have to take the branch offline to do it. (Possibly as simple
as temporarily changing the .bzr/branch-format file to indicate this
branch is offline).

So Steve- in your case, the pqm could maintain a set of old packed
histories, that gets updated periodically, and you still get your GC
routine for all the developers personal branches.

This also has some really good performance effects for getting a new
branch, since the revision information is all packed up into a single
location, there can be a lot fewer round-trips, even over dumb transports.

Thoughts?

John
=:->

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 254 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060922/39a10986/attachment.pgp