Autopack over dumb transports causes excess use of bandwidth.

Gary van der Merwe garyvdm at gmail.com
Wed Sep 9 19:57:10 BST 2009


Thanks for the reply John. I was quite frustrated when I wrote this
mail, I'm much calmer now...


On Wed, Sep 9, 2009 at 4:43 PM, John Arbash Meinel
<john at arbash-meinel.com> wrote:
> This seems wrong. Once a pack file has been 'autopacked' it should then
> contain 10 revs, and not be scheduled for repacking until you have 100 revs.

Ah - I just checked, and I see that this repo just passed 100 revs...

> This should also not really be happening, but it would depend on some
> specifics about how the groups are laid out, etc. I'm guessing you
> aren't positive about the numbers other than knowing that they are
> "bigger than I would like".

Yes - I was guessing. I need to get a better understanding of what's
happening, and then do some proper measuring.

> 1) Don't use dumb transports. This may not be an option, but the smart
> server already knows how to take a group and break it apart when only
> some of the content is requested. (eg, fetching (f4, r10) and (f4, r9)
> will create a new group 'on-the-fly' and send it over the wire, rather
> than transmitting the entire 2MB group.)
>
> The smart server *also* knows how to do autopacking locally. So when the
> target repository needs to be repacked, your local client doesn't have
> to download anything.

Not an option :-(

> 3) Look into teaching autopack to use local content if possible.
>
> This is a harder sell, and more work, but a bigger potential win for
> you. Basically, at 'push needs to autopack time' we wouldn't have to
> re-download all that 20MB that we already have locally. It is orthogonal
> to (2), in that it wouldn't change the final content on the remote site.
> It also doesn't change how much you actually upload, but it would reduce
> your download a bit.

That would be nice.

> 4) Teach autopack to deal in content size, rather than 'number of
> revisions'. At the moment, autopack considers every commit to be
> approximately the same size. Which is generally true of 'steady state',
> but ignores the fact that initial commit is often an import which is >>>
> every other commit. (In the case of MySQL, I believe this is actually
> commit 3 or so, because of how "bk init" worked back when they started.)
>
> This is also a bit harder than 1 or 2. The main problem is that
> number-of-keys is cheap to determine (it is in the header of every btree
> index), but bytes-in-the-pack is not. We only have that information by
> reading all of the indexes for a given pack file and finding the largest
> reference. (a is at 100, b is at 200, c is at 300, pack must be >300 bytes.)
>
> Most likely, that object will be a text key, because of the standard
> order of insertions (revs, inventories, chks, texts). Though there is
> nothing that requires it. And conversions actually fetch as (texts,
> inventories, chks, revs) because that was the order required by 'knit'
> repositories.
>
> I suppose we could look at modifying the btree header and have "last
> reference" sort of thing, but there are some layering violations there.
> (btree's don't really know what the 'value' field means, so you'd have
> to pass in a callable to evaluate it, etc.)

I make sense to me to try implement this in a plugin. I'll investigate this.

Thanks again for the reply John. It' given me a much better
understanding of what's going on.

Gary



More information about the bazaar mailing list