Out of memory on decompress

Tue Sep 28 18:18:44 BST 2010

First to introduce myself.  I'm an embedded software developer by trade.  I have made various small contributions to free and open source software over the years, most notably making Cinelerra's undo stack much more memory efficient, implementing its shapewipe transition, and updating some of the macro documentation for the maptool project.  I consider those small contributions to be my way of "paying" for the software I most enjoy, even though I don't have the time for long term commitments.

I would like my next contribution to help address some of the out of memory problems with bazaar.  I see these problems as falling into two broad categories.  The first is captured in bug 109114, which addresses putting large files into version control.  The second is more insidious, as it seems under certain circumstances, such as when large files are highly compressible, they can currently be put under version control, but not gotten back out again.  This is captured in bug 602614, which hit me recently when I accidentally added my 1.2GB tags file and broke my repository.  It's this second case I would like to tackle.

Before I got too far into it, I wanted to make sure I wasn't stepping on anyone's toes, and because I'm new I wanted to get a general consensus on my approach to the problem.

My proposal is in lieu of get_bytes_as() to create an iterator for the cases when a text is too large to fit into memory all at once.  That iterator would return the text in manageable chunks, which the caller would process in a "for text in myiterator" loop.  I would like a fixed, but potentially user-configurable, cutoff point for the max fulltext size, such as 128 MB.  If the decompressed text is larger than that size, you must use the iterator to process it in 128 MB chunks.  I think having a fixed cutoff will make debug and support easier compared to only falling back after catching a memory exception, as maximum application memory usage will be more predictable, and you will know based on the file size which algorithms will be used to process it.  It also simplifies the implementation of the calling routines.

I'm still working on understanding the current "chunked" storage type implementation, but as far as I can tell, the essential difference between that and my proposal is that the iterator architecture ensures the memory of the previous chunk can be freed up before decompressing the next chunk.

If there is agreement on my general approach, my next step is to implement the iterator class and update the check --repo command to use it, and submit that for review.  Based on that feedback, the other commands can then be updated to use it.

I hope I'm not being too presumptuous here.  Thanks for your feedback, and I look forward to working with you all for a while.

--Karl Bielefeldt