large files and storage formats

Fri Jul 9 07:19:23 BST 2010

> 
> I think one constraint in large file handling is the memory usage (which
> is the bug #109114 you pointed to above). IIRC bzr currently needs 2-3x
> the file size. If this can be reduced I would guess that the current 2a format
> would work fine.

i would like to stick with a native format, but ultimately i have other needs that might push me away from it. where i work we have a few users who are responsible for placing many large binary files (many greater than 500MB) under version control.  then this main repo is shared perhaps hundreds of times by other users who need to utilize -- in a read-only fashion -- the data in therein.  each of these hundreds of shared repos could potentially check out a different revision into their working copy.  with a normal dcvs that means a LOT of data checked out, and a lot of time spent checking it out, but for the shared repos, all of the disk space and time spent copying from repo to working copy is a waste because the owners of these repos only need read-only access to the data.  

so my thought was to create a loose object store as a first pass; it's fairly generic and others might find it useful. a second, optional layer would add a set of tools to provide fast read-only access to objects:  it would remove writability from blobs as they entered the store, so that they can be safely hard-linked or symbolically linked into working copies.  in this way, i can provide access to thousands of enormous files, hundreds of times over, with zero redundancy in the stores or the working copies.

> I'll note that if all you want is for content objects that are greater
> than some threshold to not be delta-compressed, you can do:
> 
> ....
> 

> That will leave you with repositories that are considered valid 2a
> format repositories, just not as 'packed' as we would normally make them.
> 
> I would guess there will be other places where our memory will be larger
> than you might like. But at least for the 'compressing 2 large blobs
> together takes too much memory' case, it would side step it.

John, thanks for this input.  so, normally if i commit new revisions of 3 files, the pack file would contain 3 deltas compressed together into a single packfile (delta'd against previous commits stored in other packfiles)?  with this modification would i end up with 3 full files in a single packfile or 3 separate packfiles?  (sorry for all the newb questions, as i mentioned in my first response i still haven't found a good explanation of the 2a format and i haven't had time to inspect it extensively).

> 'large' in this case is >4MB.
> 
> You could probably even do a little bit better, by checking the length
> of the content before calling 'self._compressor.compress()', and
> choosing to start a new block right away.
> 
> We don't currently abstract that logic as much as would be nice. So if
> you want to play with that code a bit, and potentially make it more
> flexible, we'd probably be interested in patches.
> 
> You could, for example, create a configuration variable that would
> indicate various thresholds to use in the compression algorithm. This
> would allow people to set it to whatever they wanted in a given
> repository/branch/etc.
> 
> The best part is that it stays 2a compatible, so you don't have to worry
> about bzr's without your changes being able to read your disk format.

this is definitely a very appealing approach.  i would much rather adapt and contribute than start a new format from scratch, but ultimately i'd like to end up with something that is compatible with (or superior to) the idea outlined at the top, at least for the very large files.  do you think that it is feasible to wrangle 2a in this direction? 

thanks for all the great replies!

-chad

-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://lists.ubuntu.com/archives/bazaar/attachments/20100708/e1b3e111/attachment.htm