Some notes on distributed SCM
Matt Mackall
mpm at selenic.com
Sun Apr 10 05:28:22 BST 2005
(Martin and Daniel, this is slightly updated)
General thoughts on repository scalability
Some numbers for the kernel repo:
242M of sources unpacked, ~350 compiled
17000 files (~12 changes per file)
65000 changesets
200000+ total file revisions and growing rapidly
bk repo is ~250M compressed, 500M uncompressed
Fitting used parts of repo + source + object files + rest of OS in
about 1G of RAM is a good target.
Best case asymptotic performance:
get single revision O(1) seeks O(filesize) CPU
get single delta O(1) seeks
annotate O(1) seeks if we store parallel annotations
append O(1) seeks
truncate O(1) seeks
commit O(changed files)
revert O(changed files)
diff O(changed files)
merge O(changed files) seeks
O(equivalent patch) time
O(equivalent patch) network I/O
local clone O(files) hardlinks
remote clone O(revisions) bytes transferred
I think with something like the revfile format and some careful
attention to the merge algorithm, the above is entirely doable.
Thoughts on required repo data structures
everything is either an append-only log or an append only index
an index may or may not have an associated logfile
some logs are revision logs, containing incremental deltas -> revfile
some logs are nonincremental -> logfile
each log entry has a checksum of the uncompressed data
branch-id index
each repo has a unique branch-id uuid generated at branch time
current id on tip of index
merge logfile per remote uuid
tip contains highest numbered known common changeset on local and
remote (this speeds up merge)
changeset logfile
contains changeset uuid:local changeset number and list of local
(file #, rev #) pairs updated in that changeset
file revfile
contains revisions of text (possibly with mapping of line#:rev#)
directory revfile
contains mapping of filename->file#:rev#
treated very similarly to file revfile by merge
Rough outline of distributed merge algorithm
(assumes http range requests or some similar mechanism)
no CPU overhead on server
can be pipelined nicely
def merge(remote):
grab remote branchid
grab tail of changeset log index
fill hash with tail of local changeset log index
for each remote changeset not in log queue changeset
grab spans of remote changeset log data
for changeset in remote changeset
allocate local changeset number
for file,revision in changeset
changedfiles[file].append((changeset,revision))
while files in changedfiles:
for file in changedfiles:
remember rollback point
grab update span
for changeset, revision in changedfiles[files]:
find most recent changeset touching file in remote changeset
log
if not automatic_merge:
queue merge for manual merge
queue rest of merges for this file for later automatic
next file
process queue of manual merges
requeue postponed automatic merges in changedfiles
commit local changesets to changeset index or roll back everything
--
Mathematics is the supreme nostalgia of our time.
More information about the bazaar
mailing list