2GB limit

Mon Oct 4 22:36:14 BST 2010

Ah.  Thanks for the correction.  I think I understand now.

On 10/3/10, Martin Geisler <mg at aragost.com> wrote:
> Maritza Mendez <martitzam at gmail.com> writes:
>
>> On Sat, Oct 2, 2010 at 5:24 PM, Martin Geisler <mg at aragost.com> wrote:
>>
>>> Maritza Mendez <martitzam at gmail.com> writes:
>>>
>>> > 2. Does anyone know if any other dvcs system has solved the VM
>>> > problem? If so we might put our "big file" projects in git or
>>> > Mercurial until bzr can handle them.
>>>
>>> Mercurial is also not designed for working with very large files
>>> since it loads them into memory when merging and when computing
>>> diffs. People in #mercurial tell me that Git has the same limitation.
>>>
>>> However, we have several extensions that people use to tackle this
>>> problem. The extensions all use the same basic idea: let Mercurial
>>> track a small file that has a reference to the big file.
>>>
>>> When you checkout a particular revision with 'hg update', the
>>> extension will notice that you checkout a certain version of the
>>> small file. It then follows the reference to the big file and writes
>>> that into your working copy instead of the small file.
>>>
>>> The big files are stored on a HTTP server or a shared network drive
>>> or similar -- the idea being that you will setup a central server
>>> that has enough disk space to keep all versions of the big files
>>> around. The clients only download the one version of the big files
>>> they need.
>>>
>>> Here are links to two such extensions which are used in production:
>>>
>>>  http://mercurial.selenic.com/wiki/BfilesExtension
>>>  http://mercurial.selenic.com/wiki/SnapExtension
>>
>> Thanks. I skimmed the documentation at the links you sent. My
>> philosophy is that committing binaries should be a rare use-case and
>> merging binaries should be a non-existent use-case. So something like
>> bfiles could work for me. it sounds like the bfile server starts with
>> an initial copy of the bfile and maintains a labeled sequence of
>> deltas and maybe occasionally stores a full copy to trade storage for
>> speed.
>
> No, the server holds full versions of the files. What you describe there
> is essentially how our normal "revlog" format works.
>
>> It sounds like the deltas are being computed client-side and passed to
>> the server. Is that right? If so, then there must already be a
>> bfile-diff-engine on the client. And since the client may have a
>> 32-bit VM space, I'm guessing that the diff works in segments.
>
> That would be the ideal way to do it: make Mercurial compute diffs in a
> streaming fashion where it only ever loads a small segment of the file
> into memory.
>
>> So it seems like the local problem was solved already. The real
>> benefit of bfiles seems to be that the bulky history of binary files
>> is confined to the server and does not gum up the network and all the
>> clients. Do I have that right?
>
> That is the key advantage of both extensions: they give you a hybrid
> between centralized and distributed revision control. Centralized
> revision control is good at keeping track of huge files since it's only
> the central server that must carry the burden of storing all revisions.
>
> --
> Martin Geisler
>
> aragost Trifork
> Professional Mercurial support
> http://aragost.com/mercurial/
>