BTree + CHK Inefficiencies

Fri Aug 6 22:46:29 BST 2010

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Maritza Mendez wrote:
> Related question for you guys...
> 
> Given that memory consumption associated with and/or triggered by a
> given file has (at least) three drivers -- actual size of the file;
> number of revisions of the file; and dynamic range of sizes of
> deltas-over-time committed for the file -- is there any advantage to
> splitting large versioned binaries into their own repo? 
> 
> Suppose we split our BigDaddyRepo into two repos: BigBinariesRepo (still
> pretty big) and MostlyTextRepo (perhaps only 20% as big).
> 
> Obviously, the small repo gets a huge advantage in size and speed.  But
> suppose that BigBinariesRepo really isn't static and has commits just as
> frequently as MostlyTextRepo.  (Imagine a product which is
> database-driven with frequent versioned updates to the databases.)  So
> there is just as much activity as before.  Do you expect the aggregate
> of the two repos to perform better, worse, or the same as one
> BigDaddyRepo?  In other words, is there any advantage to combined
> storage, and to what extent is that benefit negated by mixing binary and
> text data in a single repo?  Or are the two sets of data+history
> essentially non-interacting. 

I would expect that with a mixed repository, the big content will be
autopacked slightly more often than if the repositories were separate.
Any given content is generally repacked in a 'log' fashion. (every 10th
commit, then every 100th, then every 1000th, etc.) So it probably would
be repacked more often, but only like 10% more often, not like 10x more
often.

We don't (yet) pay attention to the actual size of the pack files,
though there is an open bug requesting this. (The biggest effect is in
initial-import cases, where you'll have a 1-revision pack file that has
a huge amount of content, and you'd rather just leave it alone, even
after 10 and 100 commits to other content.)

Aside from that, I'd have to say "it depends". The big-daddy content
will get put into the same pack-files as the small content, which means
for extracting small content you have to skip over those big blocks.
(better disk coherency if you split them up). We might even try to
intermix the big content with the small content in a compression group,
depending on how everything lays out.

However, regardless of how it would work, I would imagine the big
content to dominate the time. I don't have specifics, but I would
imagine that the big data repo will perform very similarly whether you
split it out or not, but the small one would get a bigger boost.

> 
> I guess what I'm really asking for is a primer on the big-O({N},{n})
> scaling of bzr, where {N} is the ordered set of integers representing
> the number of revisions of types of data (binary and text) and {n} is
> the ordered set of average sizes of files of the corresponding type.  I
> realize the question will have different answers for different
> operations.  The most interesting to me are branch, commit and stat.  I
> don't expect anyone to produce a treatise on this, but  general advice
> on whether and when it makes sense to break up repos based on size and
> content types would be super useful.  Because no matter how fast you
> guys make bzr, there will always be a user pushing the limits.  Knowing
> how to keep ourselves out of trouble could be just as valuable as
> knowing how to get out of trouble.

bzr status
 - Generally scales O(num_files) regardless of the size of those files.
   (since 99% of the time we compare the os.stat value to the cached
   value, and don't do further processing)
   There is the 1% of time where a file's content misses the stat check,
   and we have to read the file and compute its sha1 hash to see if it
   has changed.
   This can potentially happen after an update. We would write the new
   timestamp, but it may be 'too fresh' to trust that the filesystem
   would notice an update in-time, so we avoid caching to work around
   race conditions. (Someone can update/commit and then immediately
   change the content of a file, say in a script, and we want to detect
   that.)

bzr branch
 - Should scale ~ linearly with the size of the changes. Fresh commits
   of large content may not have been repacked yet, which ends up
   scaling with size(content) rather than size(delta).
   This is probably one case where splitting up the repositories would
   have a decent effect. Mostly because you'd know when you were going
   to be fetching a fair amount of data, vs when you would expect it to
   be fast.
   (Note that binary data is also deltaed, it may or may not delta well
   depending on the content.)

bzr commit
 - Should scale similarly to status at determining what to commit.
 - Peak memory is generally 1 fulltext + 2 zlib compressed texts.
 - Unless we trigger an autopack, where it can be ~4x the largest
   fulltext. I don't have hard numbers here, though. I *do* know that we
   can do better than we currently do, but a pretty hard limit will be
   ~2x the size of the largest fulltext.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkxcgrUACgkQJdeBCYSNAAMnDwCfVmPQhlOPbzv9mGjAdiXmfOG3
am4AoJNeZY8SEV3YlDqvWWZxp0hEzL/J
=UjDF
-----END PGP SIGNATURE-----