large files and storage formats

Fri Jul 9 14:54:36 BST 2010

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Chad Dombrova wrote:
>>
>> I think one constraint in large file handling is the memory usage (which
>> is the bug #109114 you pointed to above). IIRC bzr currently needs 2-3x
>> the file size. If this can be reduced I would guess that the current
>> 2a format
>> would work fine.
> 
> i would like to stick with a native format, but ultimately i have other
> needs that might push me away from it. where i work we have a few users
> who are responsible for placing many large binary files (many greater
> than 500MB) under version control.  then this main repo is shared
> perhaps hundreds of times by other users who need to utilize -- in a
> read-only fashion -- the data in therein.  each of these hundreds of
> shared repos could potentially check out a different revision into their
> working copy.  with a normal dcvs that means a LOT of data checked out,
> and a lot of time spent checking it out, but for the shared repos, all
> of the disk space and time spent copying from repo to working copy is a
> waste because the owners of these repos only need read-only access to
> the data.  

Have you looked into lightweight checkouts?

Also, 'bzr co/branch --hardlink' which will hardlink working tree files.
You would still end up with one copy in a store somewhere, and one copy
in the working tree, but all the working trees can share the copy.

...

>> I would guess there will be other places where our memory will be larger
>> than you might like. But at least for the 'compressing 2 large blobs
>> together takes too much memory' case, it would side step it.
> 
> John, thanks for this input.  so, normally if i commit new revisions of
> 3 files, the pack file would contain 3 deltas compressed together into a
> single packfile (delta'd against previous commits stored in other
> packfiles)?  with this modification would i end up with 3 full files in
> a single packfile or 3 separate packfiles?  (sorry for all the newb
> questions, as i mentioned in my first response i still haven't found a
> good explanation of the 2a format and i haven't had time to inspect it
> extensively).

A single pack file only contains deltas against files inside that pack
file. When we autopack or you manually issue 'bzr pack', we combine the
existing pack files into a larger one, and create deltas inside of that.

With this modification, you would likely end up with 3 'blocks' inside 1
pack file.

Also note that pack files have a fair amount of metadata overhead (start
of record, record length, etc) and the also zlib compress the content.
As such, you certainly can't get the bytes out directly (for something
like the hardlinking you mentioned.)

> 
>> 'large' in this case is >4MB.
>>
>> You could probably even do a little bit better, by checking the length
>> of the content before calling 'self._compressor.compress()', and
>> choosing to start a new block right away.
>>
>> We don't currently abstract that logic as much as would be nice. So if
>> you want to play with that code a bit, and potentially make it more
>> flexible, we'd probably be interested in patches.
>>
>> You could, for example, create a configuration variable that would
>> indicate various thresholds to use in the compression algorithm. This
>> would allow people to set it to whatever they wanted in a given
>> repository/branch/etc.
>>
>> The best part is that it stays 2a compatible, so you don't have to worry
>> about bzr's without your changes being able to read your disk format.
> 
> this is definitely a very appealing approach.  i would much rather adapt
> and contribute than start a new format from scratch, but ultimately i'd
> like to end up with something that is compatible with (or superior to)
> the idea outlined at the top, at least for the very large files.  do you
> think that it is feasible to wrangle 2a in this direction? 
> 
> 
> thanks for all the great replies!
> 
> -chad
> 
> 

I don't think you'll be able to make it such that the file in a 2a
repository is a hardlink to the file in the working tree. Ultimately
there is quite a bit of risk in that. Accidental modification of the
working tree copy, and suddenly your archive is corrupted for everyone,
and you don't have another copy to easily restore from. Those rare
occasions when you really do need to modify the content, and because it
is rare, someone forgets to break the hardlink first.

With 'bzr co --lightweight' and '--hardlink', you can easily get to the
point where there is 1 copy on the centralized repository, and all
working trees have hardlinked data.

'bzr co --lightweight' isn't as fast as it could be, but it is decent,
and IMO would be much more worth your time than a new storage format.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkw3KhwACgkQJdeBCYSNAAMcyQCgvNGiDCvq4yVNJ7y4gOxYmnMk
azoAmwT1y/UM9ZuiHtxapxrmYME97vYk
=scOl
-----END PGP SIGNATURE-----