MemoryError on commit with large file

Fri Oct 5 20:41:36 BST 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Thanks a lot for the feedback to my question. I don't know enough  
about diff algorithms to add anything to what John/Aaron have said  
about making the default implementation use less memory, but it  
certainly sounds good.

I guess I disagree somewhat with the default implementation being "Do  
everything in memory and fail spectacularly" vs. "Slow but reliable  
diff" with a command-line switch or config variable for --always-use- 
memory. But I'm in no position to complain or say anything more  
constructive than that.

As far as handling the crash goes, I like Rob's suggestion -- catch  
MemoryError and then fail over to some alternate workflow, like  
hashing the files to determine if changes have taken place. Although  
in a case where you're working with a lot of large files, it seems  
inefficient to try to load these into memory in the first place, not  
to mention the possible side effects on system performance if by  
using up all memory you're causing the kernel to flush cache/buffer  
pages or use swap. As a hack, you could just look at the amount of  
physical memory and send anything larger than 25% of system ram to  
the "alternate workflow" by default.

Anyway, my python is OK, so I can take a crack at writing a patch to  
catch the MemoryError and do something else. I'm totally new to  
Bazaar, so if Rob or somebody could send me a few pointers on where  
the right places in the code are, I'd appreciate it.

joel

On Oct 5, 2007, at 7:20 a.m., John Arbash Meinel wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Aaron Bentley wrote:
>> Robert Collins wrote:
>>> This will work for most cases, and will address the number of copies
>>> problem substantially, but we may still fall down on merge, which is
>>> somewhat trickier to reduce memory usage on.
>>
>> The main issue with merge is the sequence matching, and this also
>> affects diff.
>>
>> If we can substitute shorter values for lines (e.g. hashes), we can
>> potentially reduce the memory footprint of sequence matching by an  
>> order
>> of magnitude.
>>
>> Aaron
>>
>
> Just a quick comment. Depending on what we are doing, xdelta is  
> also designed
> to handle incremental updates. (As is zlib, etc). The internals are  
> a streaming
> interface. We would have to expose them (pyxdelta only exposes the
> 'compress/extract all at once' interface), but it would be possible  
> to do.
>
> That doesn't help us for merging, or showing diffs to the user  
> (which xdelta is
> not suitable for). But it would help with commit, and text extraction.
>
> But in general, we would need to have a lot of our codebase updated  
> to handle
> the streaming concepts. Though working around the concept of a  
> "text iterator"
> might do well enough. (Which could be a list of lines, or a file  
> object, or a
> custom object that reads chunks at a time.)
>
> John
> =:->
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.7 (Darwin)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>
> iD8DBQFHBkhKJdeBCYSNAAMRAlWPAJwMho5hzxHI/WI3GeX5f9QdeAgkIgCfftSE
> f+/IEmo24Q7fnLPmtDP9jp4=
> =jVUR
> -----END PGP SIGNATURE-----

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (Darwin)

iD8DBQFHBpN1CE/I8ZpjjGIRAvAoAJ9SmwyCzTEYfigMJZYWqvhsXM/e2QCdE1jr
0ylmPwWBS9SHMEAXoJ+QOW0=
=WcIC
-----END PGP SIGNATURE-----