BTree + CHK Inefficiencies

Fri Aug 6 19:19:48 BST 2010

Related question for you guys...

Given that memory consumption associated with and/or triggered by a given
file has (at least) three drivers -- actual size of the file; number of
revisions of the file; and dynamic range of sizes of deltas-over-time
committed for the file -- is there any advantage to splitting large
versioned binaries into their own repo?

Suppose we split our BigDaddyRepo into two repos: BigBinariesRepo (still
pretty big) and MostlyTextRepo (perhaps only 20% as big).

Obviously, the small repo gets a huge advantage in size and speed.  But
suppose that BigBinariesRepo really isn't static and has commits just as
frequently as MostlyTextRepo.  (Imagine a product which is database-driven
with frequent versioned updates to the databases.)  So there is just as much
activity as before.  Do you expect the aggregate of the two repos to perform
better, worse, or the same as one BigDaddyRepo?  In other words, is there
any advantage to combined storage, and to what extent is that benefit
negated by mixing binary and text data in a single repo?  Or are the two
sets of data+history essentially non-interacting.

I guess what I'm really asking for is a primer on the big-O({N},{n}) scaling
of bzr, where {N} is the ordered set of integers representing the number of
revisions of types of data (binary and text) and {n} is the ordered set of
average sizes of files of the corresponding type.  I realize the question
will have different answers for different operations.  The most interesting
to me are branch, commit and stat.  I don't expect anyone to produce a
treatise on this, but  general advice on whether and when it makes sense to
break up repos based on size and content types would be super useful.
Because no matter how fast you guys make bzr, there will always be a user
pushing the limits.  Knowing how to keep ourselves out of trouble could be
just as valuable as knowing how to get out of trouble.

Thanks
~M

On Fri, Aug 6, 2010 at 1:31 AM, John Szakmeister <john at szakmeister.net>wrote:

> On Thu, Aug 5, 2010 at 10:07 PM, John Arbash Meinel
> <john at arbash-meinel.com> wrote:
> [snip]
> > Well, if you are interested in helping out, you can do the checkout with
> > extensions, and get a memory dump.
> > You'll need 'Meliae' which is my memory debugging python library. 'bzr
> > branch lp:meliae'.
>
> I'd love to, but I can't give you a memory dump. :-(  But...
>
> [snip]
> > Now, it is possible that the memory consumed is actually because of
> > individual file content, and not what I'm doing here (which is more
> > about lots of inventory data, aka lots of small files).
>
> I think this is my problem anyways.  They checked in some rather large
> files at one point, and it appears to be consuming several times the
> file size during checkout.  It seems to also be related to how many
> revisions of the file was made... but it's been a while since looking
> at this, and once we found another way to do what we need, we moved
> on, so I didn't spend much more time looking at it. :-(
>
> -John
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://lists.ubuntu.com/archives/bazaar/attachments/20100806/13ce1428/attachment-0001.htm