BTree + CHK Inefficiencies

Maritza Mendez martitzam at gmail.com
Sat Aug 7 00:01:14 BST 2010


On Fri, Aug 6, 2010 at 2:46 PM, John Arbash Meinel
<john at arbash-meinel.com>wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Maritza Mendez wrote:
> > Related question for you guys...
> >
> > Given that memory consumption associated with and/or triggered by a
> > given file has (at least) three drivers -- actual size of the file;
> > number of revisions of the file; and dynamic range of sizes of
> > deltas-over-time committed for the file -- is there any advantage to
> > splitting large versioned binaries into their own repo?
> >
> > Suppose we split our BigDaddyRepo into two repos: BigBinariesRepo (still
> > pretty big) and MostlyTextRepo (perhaps only 20% as big).
> >
> > Obviously, the small repo gets a huge advantage in size and speed.  But
> > suppose that BigBinariesRepo really isn't static and has commits just as
> > frequently as MostlyTextRepo.  (Imagine a product which is
> > database-driven with frequent versioned updates to the databases.)  So
> > there is just as much activity as before.  Do you expect the aggregate
> > of the two repos to perform better, worse, or the same as one
> > BigDaddyRepo?  In other words, is there any advantage to combined
> > storage, and to what extent is that benefit negated by mixing binary and
> > text data in a single repo?  Or are the two sets of data+history
> > essentially non-interacting.
>
> I would expect that with a mixed repository, the big content will be
> autopacked slightly more often than if the repositories were separate.
> Any given content is generally repacked in a 'log' fashion. (every 10th
> commit, then every 100th, then every 1000th, etc.) So it probably would
> be repacked more often, but only like 10% more often, not like 10x more
> often.
>
> We don't (yet) pay attention to the actual size of the pack files,
> though there is an open bug requesting this. (The biggest effect is in
> initial-import cases, where you'll have a 1-revision pack file that has
> a huge amount of content, and you'd rather just leave it alone, even
> after 10 and 100 commits to other content.)
>
> Aside from that, I'd have to say "it depends". The big-daddy content
> will get put into the same pack-files as the small content, which means
> for extracting small content you have to skip over those big blocks.
> (better disk coherency if you split them up). We might even try to
> intermix the big content with the small content in a compression group,
> depending on how everything lays out.
>
> However, regardless of how it would work, I would imagine the big
> content to dominate the time. I don't have specifics, but I would
> imagine that the big data repo will perform very similarly whether you
> split it out or not, but the small one would get a bigger boost.
>
> >
> > I guess what I'm really asking for is a primer on the big-O({N},{n})
> > scaling of bzr, where {N} is the ordered set of integers representing
> > the number of revisions of types of data (binary and text) and {n} is
> > the ordered set of average sizes of files of the corresponding type.  I
> > realize the question will have different answers for different
> > operations.  The most interesting to me are branch, commit and stat.  I
> > don't expect anyone to produce a treatise on this, but  general advice
> > on whether and when it makes sense to break up repos based on size and
> > content types would be super useful.  Because no matter how fast you
> > guys make bzr, there will always be a user pushing the limits.  Knowing
> > how to keep ourselves out of trouble could be just as valuable as
> > knowing how to get out of trouble.
>
>
> bzr status
>  - Generally scales O(num_files) regardless of the size of those files.
>   (since 99% of the time we compare the os.stat value to the cached
>   value, and don't do further processing)
>   There is the 1% of time where a file's content misses the stat check,
>   and we have to read the file and compute its sha1 hash to see if it
>   has changed.
>   This can potentially happen after an update. We would write the new
>   timestamp, but it may be 'too fresh' to trust that the filesystem
>   would notice an update in-time, so we avoid caching to work around
>   race conditions. (Someone can update/commit and then immediately
>   change the content of a file, say in a script, and we want to detect
>   that.)
>
> bzr branch
>  - Should scale ~ linearly with the size of the changes. Fresh commits
>   of large content may not have been repacked yet, which ends up
>   scaling with size(content) rather than size(delta).
>   This is probably one case where splitting up the repositories would
>   have a decent effect. Mostly because you'd know when you were going
>   to be fetching a fair amount of data, vs when you would expect it to
>   be fast.
>   (Note that binary data is also deltaed, it may or may not delta well
>   depending on the content.)
>
> bzr commit
>  - Should scale similarly to status at determining what to commit.
>  - Peak memory is generally 1 fulltext + 2 zlib compressed texts.
>  - Unless we trigger an autopack, where it can be ~4x the largest
>   fulltext. I don't have hard numbers here, though. I *do* know that we
>   can do better than we currently do, but a pretty hard limit will be
>   ~2x the size of the largest fulltext.
>
> John
> =:->
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.9 (Cygwin)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
>
> iEYEARECAAYFAkxcgrUACgkQJdeBCYSNAAMnDwCfVmPQhlOPbzv9mGjAdiXmfOG3
> am4AoJNeZY8SEV3YlDqvWWZxp0hEzL/J
> =UjDF
> -----END PGP SIGNATURE-----
>


Thanks, John, for that concise overview of a complex topic.  (We all know
it's harder to be concise than to just dd a mountain of data, so special
thanks.)  It sounds like we would expect the time and memory differences
between P(Big) + P(small) and P(Big+small) to be relatively minor, depending
on the degree of mixing in individual packs and how often repacking is
triggered.  And this mixing may or may not depend sensitively on the
detailed distribution of texts.  So what I really take away is that there is
no penalty for breaking up the repo, because the cost (time or memory) of
P(Big)+P(small) should always provide a lower bound on the cost of
P(Big+small).  In other words, there's no break-up penalty (how could there
be, really?) and if we take care to avoid serialization at bottlenecks (e.g.
vfs or network) we might even win by doing some operations in parallel.

Thanks for sharing your insight!
~M
-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://lists.ubuntu.com/archives/bazaar/attachments/20100806/1ec5af22/attachment.htm 


More information about the bazaar mailing list