large files and storage formats

Fri Jul 9 06:38:11 BST 2010

On Fri, Jul 9, 2010 at 9:07 AM, Chad Dombrova <chadrik at gmail.com> wrote:
> hi all,
> i've got some questions regarding bzr and large binary files.
> first of all, i've read about bzr's long-standing issues with large files
> (https://bugs.launchpad.net/bzr/+bug/109114).  while fixing this issue would
> be a worthy and noble cause, i have a fairly specific use case, and based on
> a lot of recent experience i know there's a *very* high probability that
> once this issue is fixed i'll run into other roadblocks with the current
> storage format.
> what interests me about bazaar is what the docs tout as its flexible
> architecture: that it "is cleanly layered to support multiple file formats".
>  that got me thinking: could i implement a more git-like loose object
> storage format into bazaar?
> for those who aren't familiar with git's loose object model, it works
> something like this:  blobs represent data, trees represent the location of
> data, a commit represents a change, and every object, regardless of type, is
> stored as a separate loose file in the store.
> this is great for working with large files for 2 reasons:
> 1) files can be moved/renamed without generating duplicate data in the
> object store: it's just a new tree object

Hi Chad,

The bzr 2a format seems to be quite efficient at handling moves and updates.
I did a little experiment moving and updating a 100MB file. The file has
random data so it doesn't compress (which is what I wanted). The observation
is that while the disk usage of .bzr seems to go up, pack bring is back to that
file size. So, there is no duplication of data for moves.
Note that bzr does pack automatically, I just forced it for this experiment.

[bigfile]% dd if=/dev/urandom of=urandom.dat bs=$(( 1024 * 1024 )) count=100
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 30.1808 s, 3.5 MB/s
[bigfile]% bzr init
Created a standalone tree (format: 2a)
[bigfile]% du -sh .bzr
84K     .bzr
[bigfile]% ll
total 100M
-rw-r--r-- 1 parthm parthm 100M 2010-07-09 10:13 urandom.data
[bigfile]% bzr add
adding urandom.data
[bigfile]% bzr ci -m urandom
Committing to: /home/parthm/tmp/bigfile/
added urandom.data
Committed revision 1.
[bigfile]% du -sh .bzr
101M    .bzr
[bigfile]% bzr mv urandom.data urandom1.data
urandom.data => urandom1.data
[bigfile]% bzr ci -m urandom1
Committing to: /home/parthm/tmp/bigfile/
renamed urandom.data => urandom1.data
Committed revision 2.
[bigfile]% du -sh .bzr
201M    .bzr
[bigfile]% bzr pack --clean-obsolete-packs
[bigfile]% du -sh .bzr
101M    .bzr
[bigfile]% echo "hello world" >> urandom1.data
[bigfile]% bzr st
modified:
  urandom1.data
[bigfile]% bzr ci -m "updated file"
Committing to: /home/parthm/tmp/bigfile/
modified urandom1.data
Committed revision 3.
[bigfile]% du -sh .bzr
201M    .bzr
[bigfile]% bzr pack --clean-obsolete-packs
[bigfile]% du -sh .bzr
101M    .bzr
[bigfile]%

I think one constraint in large file handling is the memory usage (which
is the bug #109114 you pointed to above). IIRC bzr currently needs 2-3x
the file size. If this can be reduced I would guess that the current 2a format
would work fine.

Regards,
Parth

> 2) it does not use delta compression, which is not time or size efficient on
> large binary files. blobs are compressed using zlib and the compression
> strength is configurable
> why don't i just use git?  i abhor the way that it is designed. i need a vcs
> that is user friendly, doubles as it's own api, is easily extended, and is
> preferably written in python (with support for pure-python hooks).  so far
> bazaar seems to fit these requirements quite well.
> so, i'd like some honest opinions:
> - is bazaar really so well layered that new storage formats can be added
> without the need to rewrite higher level code?
> - how difficult a task is this, approximately, in man hours?  (keep in
> mind, git's object model has already been implemented in python
> (http://samba.org/~jelmer/dulwich/), so i'm mostly concerned with the time
> it would take to interface this with bazaar in all the right places.)
> i've looked through the docs and i can't find any information on how to get
> started on writing a new storage format (which i take as a sign that it is
> probably very difficult).  assuming that this goal is not laughably lofty,
> and that there are not other better alternatives, i'd love some guidance on
> how this might be pulled off.
> thanks,
> chad
>
>
>
>