large files and storage formats
Eric Siegerman
lists08-bzr at davor.org
Fri Jul 9 15:40:26 BST 2010
On Thu, 2010-07-08 at 23:19 -0700, Chad Dombrova wrote:
> where i work we have a few users who are responsible for
> placing many large binary files (many greater than 500MB) under
> version control. then this main repo is shared perhaps
> hundreds of times by other users who need to utilize -- in a
> read-only fashion -- the data in therein. each of these
> hundreds of shared repos could potentially check out a
> different revision into their working copy. with a normal dcvs
> that means a LOT of data checked out, and a lot of time spent
> checking it out, but for the shared repos, all of the disk
> space and time spent copying from repo to working copy is a
> waste because the owners of these repos only need read-only
> access to the data.
>
> so my thought was to create a loose object store as a first
> pass; it's fairly generic and others might find it useful. a
> second, optional layer would add a set of tools to provide fast
> read-only access to objects: it would remove writability from
> blobs as they entered the store, so that they can be safely
> hard-linked or symbolically linked into working copies. in
> this way, i can provide access to thousands of enormous files,
> hundreds of times over, with zero redundancy in the stores or
> the working copies.
I can't comment on your bzr-internals questions, but taking a step
back...
You seem to be stating a few requirements:
1. Avoid downloading *repository data* that is available locally,
or isn't needed locally in the first place
2. Avoid downloading *working-copy files* that are already
available locally; likewise, avoid regenerating them from
repository data
3. Don't blow up (or, though you didn't say this, page-thrash)
due to needing too much memory
Is that an accurate summation?
It's important not to conflate (1) and (2); they're quite
distinct.
Requirement (2), avoiding redundant copies or regenerations of
working files, is interesting and would be useful, but it seems
to me to be orthogonal to (1). That is, I don't see how bzr's
internal storage format would affect working-tree management
either way; I'd expect them to be at different layers entirely.
Actually, I could see implementing this as a fairly
straightforward script (python, shell, etc.):
- maintain a central, globally readable store of checked-out
revisions (whether those are created via lightweight
checkouts (1a), branches within a shared repo (1b) -- or,
indeed, by something like rsync -- is irrelevant to the
current requirement)
- when a user says "I need to use rev. X", symlink from that
rev's directory within your central store into the user's
own workspace
- optionally, provide some mechanism to garbage-collect revs
from the central store that are no longer in use (i.e. only
their working trees; *not* the repository data).
Alternatively, if you have the disk space, just let the
revision store grow without bound, and prune it manually once
in a while.
As for memory issues (requirement 3), replacing or tuning 2a
won't help unless all of the memory problems are in the 2a code,
and JAM seems to say this isn't the case.
That leaves requirement (1), avoiding redundant downloads of
repository data.
Bazaar already offers a few possible solutions to this. Have you
considered:
a. Having your read-only users use *lightweight* checkouts
instead of branches (or heavy checkouts, which imply
branches). For a lightweight checkout, bzr only needs to
locally store a small amount of bookkeeping metadata, not
the actual (in your case, huge) repository data for each
revision. Thus, the repository data is never stored
locally.
b. Setting up a local shared repository, which your users all
share. This results in at most one copy of each revision's
repository data per host; *not* one copy per user.
c. Perhaps using stacked branches? Not sure; all I know about
these is what I've read on this list -- including that
they've had problems in the past -- but they might be worth
looking into.
(b) offers its biggest win if many of your read-only users share
machines with each other (you don't say this is the case, but
it's implied by requirement (2)). If everybody's on their own
workstation, though, (b) clearly offers little savings.
Even if people do in fact share machines, there's a tradeoff
between:
- the cost of the one-time quasi-redundant download of each
revision's repository data, and
- a savings if, some time later, a user wants the same
revision, in which case the latter's repository data is
already locally cached
So the details of your usage pattern will determine whether (a)
or (b) is optimal.
All that said, if you're dead-set on a loose-object store, you
might be able to use git's directly, instead of engineering a
git-like one for bzr. That is, store the data in a real git
repo, but access that repo via bzr and the appropriate plugin(s).
I don't know what issues that might present, so you'd have to ask
others. This might be, for you, the best of both worlds -- the
git storage format, but not the git code.
In summary, I can't help thinking that, in terms of the
requirements as I understand them (and again, please correct me
if I've *mis*understood them), digging into storage-format work
seems to be attacking the wrong problem.
- Eric
More information about the bazaar
mailing list