Recommended backup procedure and preserving my data...

John Arbash Meinel john at arbash-meinel.com
Fri Oct 16 15:27:55 BST 2009


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

John Szakmeister wrote:
> On Thu, Oct 15, 2009 at 2:22 PM, John Arbash Meinel
> <john at arbash-meinel.com> wrote:
> [snip]
>> There has been some work to make this possible. The current main problem
>> is that bzr can represent "ghosts" (revisions whose identifier we know,
>> but where we do not have the actual content for the revision.)
> 
> For my own edification, where do those situations come up?

99.9% of the time from conversions from other systems. 0.1% of the time
from using bzr before it fetched during merge. (bzr's codebase itself is
probably the only instance where this is the case.)

bzr-svn, for example, can produce ghosts. Namely doing:

bzr co $UPSTREAM upstream
bzr branch upstream local
cd local
commit commit commit
cd ../upstream
bzr merge ../local
bzr commit -m "Merge my changes back to svn."

At this point, in *SVN* it records that this revision was a merge, but
svn does not have the "local" commits. It has a pointer to a revision,
but none of the data for that revision.

If someone else independently then does:

bzr co $UPSTREAM upstream

They have ghosts until they somehow get direct access to your "local"
branch.

So in *your* case, this probably never happens.


...

> [snip]
>>> It doesn't need to be absolutely minimal churn... I can cope with the
>>> autopacking.  We don't have much (in terms of size), but he have 50 or
>>> more Subversion repositories at the moment.  And it seems to grow
>>> every week. :-)
>> As for Bazaar repos, you can have as many or as few as works for you.
>> You can share multiple projects in one repo, or have one repo per
>> project, or one repo per branch... The actual layout tends to be
>> dictated by access control (balanced against disk storage).
> 
> Is there some limit on throughput?  Seems like that at some point that
> would have to become a factor, since the actual revisions would be in
> the shared repo, correct?
> 
> I'd more than likely go with the shared-repo-per-project approach.

'limit on throughput'? Two people cannot update a branch at the same
time (we create a lock while you are pushing your changes.)
However, the repository design is such that changes are pushed to
'upload', renamed into place and then we only need to take out a
physical lock for the brief seconds that we update the 'pack-names'
file. (take a lock, read the file, compare it with our known info,
update accordingly, write it out, unlock)

So by design, you can have a lot of concurrent writers to the same
repository. Potentially they can write redundant data, but we will
filter that out during autopack, etc.

For *readers* most of the repository data is static (until an autopack).
And the reader code has been taught that "if something goes missing,
checkpoint, reread pack-names, and try again".
If you had *tons* of concurrent writing, which was triggering lots of
autopacking, you may get a couple restarts. However, the data you
already read is 'still good', and because of exponential backoff, you
need a lot of new data to effect the old data more than once.

Lets say you start with a repository that has 1001 commits. 1000 of them
are in a big pack file, and 1 of them is in a little pack file.

If you start fetching the data for those 1001 commits, and somebody
pushes up 10 new commits and triggers an autopack. That will *only*
affect the 1 revision pack file. But lets say that happens, so you
restart the read. They have to push up ~90 more changes to get that
10-revision pack file to be repacked into a 100-revision pack file.
(During that time, they will create 9 packs, then autopack those into a
10-pack, but all of that data is newer then the view when you started
fetching, so you won't try to access any of the new data.)

If by some chance they still manage to generate 100 new revisions and
trigger another autopack before you get the data for that 1 revision.
You will start over again. Only this time it takes ~900 new revisions to
repack the 100 revision pack into a 1000 revision pack.

So *if* you are getting 1000 new revisions in the time it takes you to
fetch the whole data, you might restart 3 times.

All of this changes if you have someone / something run 'bzr pack'
manually, as that says 'move all data into a single new pack file'.
Which touches everything. However, that should be quite rare. Autopack
generally takes care of what you really care about. And if you *really*
needed it to be run, I would cron it to run at whatever time the repo
has the least activity.


So in summary, we've designed it to be fairly insulated to concurrent
readers and writers. It may not be perfect, but it is pretty darn good.

Oh, and I should note that if you are using smart requests, the
'restart' is done server-side. I've had someone run "bzr pack" on my
repository while fetching and never noticed a hiccup in the data stream.


...
> 
> I guess I still have the issue mentioned here though:
>     http://doc.bazaar-vcs.org/latest/en/user-guide/http_smart_server.html#pushing-over-bzr-http
> 
> That is, I need to do something so that I can segregate write access
> and read access.

Do you want to allow anonymous access? Or you are just saying that you
want some people at a given branch to have readonly access, and others
to have write access.

I don't know if that is easy to do with Apache's proxying. (Though you
could have a readonly smart server running and a read-write one running,
and Apache's acls would redirect the requests to the server which gives
the appropriate response.)

You might also want to look at "contrib/bzr_access" which has a single
ssh user, and controls ACLs based on the ssh key given. It also gives a
restricted shell. (So users can't login directly and do anything, they
can only spawn 'bzr' and talk with it.)

There is also some work that has been done for ACLTransport, but I don't
think that is complete.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkrYgusACgkQJdeBCYSNAANd6ACgpSLpB7ui8iIlKn1pT1Ucx2jR
FGwAoI6UsOcY/RDxZKrOcFlUOnqt6ySq
=5PoG
-----END PGP SIGNATURE-----



More information about the bazaar mailing list