large files and storage formats

Fri Jul 9 15:40:26 BST 2010

On Thu, 2010-07-08 at 23:19 -0700, Chad Dombrova wrote:
> where i work we have a few users who are responsible for
> placing many large binary files (many greater than 500MB) under
> version control.  then this main repo is shared perhaps
> hundreds of times by other users who need to utilize -- in a
> read-only fashion -- the data in therein.  each of these
> hundreds of shared repos could potentially check out a
> different revision into their working copy.  with a normal dcvs
> that means a LOT of data checked out, and a lot of time spent
> checking it out, but for the shared repos, all of the disk
> space and time spent copying from repo to working copy is a
> waste because the owners of these repos only need read-only
> access to the data.  
> 
> so my thought was to create a loose object store as a first
> pass; it's fairly generic and others might find it useful. a
> second, optional layer would add a set of tools to provide fast
> read-only access to objects:  it would remove writability from
> blobs as they entered the store, so that they can be safely
> hard-linked or symbolically linked into working copies.  in
> this way, i can provide access to thousands of enormous files,
> hundreds of times over, with zero redundancy in the stores or
> the working copies.

I can't comment on your bzr-internals questions, but taking a step
back...

You seem to be stating a few requirements:
  1. Avoid downloading *repository data* that is available locally,
     or isn't needed locally in the first place

  2. Avoid downloading *working-copy files* that are already
     available locally; likewise, avoid regenerating them from
     repository data

  3. Don't blow up (or, though you didn't say this, page-thrash)
     due to needing too much memory

Is that an accurate summation?

It's important not to conflate (1) and (2); they're quite
distinct.

Requirement (2), avoiding redundant copies or regenerations of
working files, is interesting and would be useful, but it seems
to me to be orthogonal to (1).  That is, I don't see how bzr's
internal storage format would affect working-tree management
either way; I'd expect them to be at different layers entirely.

Actually, I could see implementing this as a fairly
straightforward script (python, shell, etc.):
  - maintain a central, globally readable store of checked-out
    revisions (whether those are created via lightweight
    checkouts (1a), branches within a shared repo (1b) -- or,
    indeed, by something like rsync -- is irrelevant to the
    current requirement)

  - when a user says "I need to use rev. X", symlink from that
    rev's directory within your central store into the user's
    own workspace

  - optionally, provide some mechanism to garbage-collect revs
    from the central store that are no longer in use (i.e. only
    their working trees; *not* the repository data).
    Alternatively, if you have the disk space, just let the
    revision store grow without bound, and prune it manually once
    in a while.

As for memory issues (requirement 3), replacing or tuning 2a
won't help unless all of the memory problems are in the 2a code,
and JAM seems to say this isn't the case.

That leaves requirement (1), avoiding redundant downloads of
repository data.

Bazaar already offers a few possible solutions to this.  Have you
considered:
  a. Having your read-only users  use *lightweight* checkouts
     instead of branches (or heavy checkouts, which imply
     branches).  For a lightweight checkout, bzr only needs to
     locally store a small amount of bookkeeping metadata, not
     the actual (in your case, huge) repository data for each
     revision.  Thus, the repository data is never stored
     locally.

  b. Setting up a local shared repository, which your users all
     share.  This results in at most one copy of each revision's
     repository data per host; *not* one copy per user.

  c. Perhaps using stacked branches?  Not sure; all I know about
     these is what I've read on this list -- including that
     they've had problems in the past -- but they might be worth
     looking into.

(b) offers its biggest win if many of your read-only users share
machines with each other (you don't say this is the case, but
it's implied by requirement (2)).  If everybody's on their own
workstation, though, (b) clearly offers little savings.

Even if people do in fact share machines, there's a tradeoff
between:
  - the cost of the one-time quasi-redundant download of each
    revision's repository data, and
  - a savings if, some time later, a user wants the same
    revision, in which case the latter's repository data is
    already locally cached

So the details of your usage pattern will determine whether (a)
or (b) is optimal.

All that said, if you're dead-set on a loose-object store, you
might be able to use git's directly, instead of engineering a
git-like one for bzr.  That is, store the data in a real git
repo, but access that repo via bzr and the appropriate plugin(s).
I don't know what issues that might present, so you'd have to ask
others.  This might be, for you, the best of both worlds -- the
git storage format, but not the git code.

In summary, I can't help thinking that, in terms of the
requirements as I understand them (and again, please correct me
if I've *mis*understood them), digging into storage-format work
seems to be attacking the wrong problem.

  - Eric