branch locking mk2.

Thu Feb 16 06:37:14 GMT 2006

Robert Collins wrote:
> On Wed, 2006-02-15 at 23:33 -0600, John A Meinel wrote:
...

> 
> When a index entry X references revision content in the knit we need to
> resolve the reference to tell if we have that content. There are two
> ways of doing this: download the referenced thing in some part, or have
> the unique id in the index.
> 
>> Do you mean without downloading the knit context, or you mean being able
>> to skip over downloading that part of the index?
> 
> The latter.

But how do you know what to skip and what to read? Are you going to
guarantee an order?
I might have extra revisions in my local knit (say I uncommitted stuff).
So I don't know ahead of time how far I can jump in the remote knit to
get stuff that I need.
Maybe you can work backwards, and grab stuff off the back until you
figure out that you aren't missing anything.

> 
>>> hg uses the exact sized index records to achieve this AIUI, using file
>>> size mod record size to determine if there is partial records present. 
>>> That scales [with a low constant factor] linearly to revisions though,
>>> which we want to avoid.
>>>
>>> Rob
>>>
>> The problem is my index isn't going to be the same as your index. So I
>> don't know what I need to skip. My understanding is that we always need
>> to download the complete index. So we need to keep it small. (hg fixes
>> it to something like 48-60 bytes).
> 
> say we have 60K revisions, and 3000 files modified once per revision.
> (Ok, this is an extreme example):
> 60K revisions * 60 bytes * 3K files = 10.8 * 10^9 bytes.
> So a single commit that touches each file (say a (c) update) in that
> repository will require uploading 60*3K bytes - 180Kbytes. And
> downloading ~10Gb to determine that the basis entries are there.
> And downloading ~10Gb to update from that repository.

Honestly I think that if we have 60K revisions that each modify 3k
files, we are going to have *MUCH* bigger problems elsewhere, rather
than worrying about the size of the indexes.
I don't think bzr is designed to scale to 60k*3k = 180M changes. If you
are getting that big, I think we need to work out a way to expire old
history. Perhaps having a tiered access.

Just thinking about the fact that deltas are almost guaranteed to be
larger than indexes, you end up with 60K * 3K * 1k? or 180GB of actual
content changes.
Which means that you have to download 180GB to 'bzr get' the project.

I really think you are optimizing the wrong thing. Having to read an
entire index file will scale bad when you have 60k * 3k changes, yes. I
completely agree. But I haven't seen a reasonable solution for how I
know what I can and can't download. If you come up with it, more power
to you, and I would be happy to see it.

> 
>> I don't know how we can prevent scaling indexes to O(revisions). my goal
>> is just to make the constant factor small enough, that the important
>> scaling is the O(delta size) for the actual knit content.
> 
> The answer is simple: Don't depend on any data in the knit when you
> upload: I don't mean 'dont upload deltas', I mean 'dont substitute data
> like the revision id for data like knit indices.' With a self delimiting
> format and no external references, you upload marginally more data, but
> you can pull in O(revisions-needed) rather than O(total history) in the
> best case, and I think in the average case its O(logn(total-history))
> rather than the pessimistic approach hg has.

I don't know how we are fleshing out the knit data itself. I was
thinking at one point we would use per-line annotation. But I suppose
you could translate that locally for each delta.

*However* you are trading off scaling an index (which is by nature
fairly small), for scaling deltas, which are already big.

If you can give me an accurate way to know ahead of time what portions
of the index I'm going to care about, then I'll concede that we don't
want knit-wide revision-id=>short-id mappings. (We can do something per
delta, or something like that).

Now, doing it per delta also has the advantage that we don't have to
translate when we pull. But I think we should be checking the contents
anyway, so the translation is not a big expense. (The problem with
weaves is that we have to re-merge the new texts. With knits we probably
should expand the text to verify the sha1 sum, but that is much cheaper
than creating a new diff).

> 
> rough estimate, say we have:
> 4 bytes of MAGIC for the header.
> length-prefixed string for revid (16bit prefix)
> length-prefixed array of parent ids in the knit [each is length-prefixed
> string], and the array is 16 bit length prefixed
> 128bit file index for start of content
> 32-bit content length
> 32 bit integrity checksum
> ====
> if each commit has a 40byte revision id and 2 parents, the above example
> gets:
> each header:
> 4 + (2 + 40) + (2 + 2*(2+40)) + 16 + 4 + 4 = 156 bytes.
> total metadata in the repo:
> 60K * 156 * 3K = 28.08 * 10 ^ 9
> now, if we assume that on average the revision we are going to be
> committing against is in the the last 5 average-sized records we need to
> download 5 * 156 * 3K = 2.3M to prepare our upload data when committing,
> and we need to upload 468K to do the commit.
> 
> Thats why scaling O(change) not O(history size) is -so- important.
> 
> Rob
> 

This sounds like you are just downloading the last X bytes from the
remote knit. Is that how you are guessing what data you need?

John
=:->

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 249 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060216/ffd29a59/attachment.pgp