branch locking mk2.

Robert Collins robertc at robertcollins.net
Thu Feb 16 06:12:54 GMT 2006


On Wed, 2006-02-15 at 23:33 -0600, John A Meinel wrote:
> Robert Collins wrote:
> > On Wed, 2006-02-15 at 22:09 -0600, John A Meinel wrote:
> >> There will probably be some control files that are still atomically
> >> replaced, like revision-history (as long as that exists).
> >> Also, I don't know how we are trying to keep knit index files intact.
> >> I'm guessing we just expect that adding 1 line will be an atomic
> >> action,
> >> as long as we add less than X number of bytes.
> >> I can think of some things we could do (like adding a checksum +
> >> line-length at the beginning of each line, along with a line delimiter
> >> before the line (a simple newline could do)).
> >> Then we can throw out index lines that don't match their checksum. And
> >> the delimiter makes sure that we know where the entries start & stop. 
> > 
> > Right. The index has to be self delimiting and self verifiable as well
> > as not requiring a full download to retrieve revision references. I
> > haven't checked if this is the case yet. 
> > 
> > I.e. in the index we if we have
> > ------ 
> > record A
> > ------
> > partial record
> > ------
> > record B
> > parent: revision in record A
> > ------
> > 
> > 
> > firstly the partial record must not be able to confuse the parser.
> > secondly if I have the revision in record A, I must be able to tell what
> > it was without downloading its index entry. (Otherwise we scale
> > O(revisions) rather than (O(delta size))
> 
> I don't understand what you mean here about needing to 'tell what it was
> without downloading its index'.

When a index entry X references revision content in the knit we need to
resolve the reference to tell if we have that content. There are two
ways of doing this: download the referenced thing in some part, or have
the unique id in the index.

> Do you mean without downloading the knit context, or you mean being able
> to skip over downloading that part of the index?

The latter.

> > 
> > hg uses the exact sized index records to achieve this AIUI, using file
> > size mod record size to determine if there is partial records present. 
> > That scales [with a low constant factor] linearly to revisions though,
> > which we want to avoid.
> > 
> > Rob
> > 
> 
> The problem is my index isn't going to be the same as your index. So I
> don't know what I need to skip. My understanding is that we always need
> to download the complete index. So we need to keep it small. (hg fixes
> it to something like 48-60 bytes).

say we have 60K revisions, and 3000 files modified once per revision.
(Ok, this is an extreme example):
60K revisions * 60 bytes * 3K files = 10.8 * 10^9 bytes.
So a single commit that touches each file (say a (c) update) in that
repository will require uploading 60*3K bytes - 180Kbytes. And
downloading ~10Gb to determine that the basis entries are there.
And downloading ~10Gb to update from that repository.

> I don't know how we can prevent scaling indexes to O(revisions). my goal
> is just to make the constant factor small enough, that the important
> scaling is the O(delta size) for the actual knit content.

The answer is simple: Don't depend on any data in the knit when you
upload: I don't mean 'dont upload deltas', I mean 'dont substitute data
like the revision id for data like knit indices.' With a self delimiting
format and no external references, you upload marginally more data, but
you can pull in O(revisions-needed) rather than O(total history) in the
best case, and I think in the average case its O(logn(total-history))
rather than the pessimistic approach hg has.

rough estimate, say we have:
4 bytes of MAGIC for the header.
length-prefixed string for revid (16bit prefix)
length-prefixed array of parent ids in the knit [each is length-prefixed
string], and the array is 16 bit length prefixed
128bit file index for start of content
32-bit content length
32 bit integrity checksum
====
if each commit has a 40byte revision id and 2 parents, the above example
gets:
each header:
4 + (2 + 40) + (2 + 2*(2+40)) + 16 + 4 + 4 = 156 bytes.
total metadata in the repo:
60K * 156 * 3K = 28.08 * 10 ^ 9
now, if we assume that on average the revision we are going to be
committing against is in the the last 5 average-sized records we need to
download 5 * 156 * 3K = 2.3M to prepare our upload data when committing,
and we need to upload 468K to do the commit.

Thats why scaling O(change) not O(history size) is -so- important.

Rob

-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060216/32f3b17d/attachment.pgp 


More information about the bazaar mailing list