MemoryError on commit with large file
Joel Hardi
joel at hardi.org
Fri Oct 5 20:41:36 BST 2007
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Thanks a lot for the feedback to my question. I don't know enough
about diff algorithms to add anything to what John/Aaron have said
about making the default implementation use less memory, but it
certainly sounds good.
I guess I disagree somewhat with the default implementation being "Do
everything in memory and fail spectacularly" vs. "Slow but reliable
diff" with a command-line switch or config variable for --always-use-
memory. But I'm in no position to complain or say anything more
constructive than that.
As far as handling the crash goes, I like Rob's suggestion -- catch
MemoryError and then fail over to some alternate workflow, like
hashing the files to determine if changes have taken place. Although
in a case where you're working with a lot of large files, it seems
inefficient to try to load these into memory in the first place, not
to mention the possible side effects on system performance if by
using up all memory you're causing the kernel to flush cache/buffer
pages or use swap. As a hack, you could just look at the amount of
physical memory and send anything larger than 25% of system ram to
the "alternate workflow" by default.
Anyway, my python is OK, so I can take a crack at writing a patch to
catch the MemoryError and do something else. I'm totally new to
Bazaar, so if Rob or somebody could send me a few pointers on where
the right places in the code are, I'd appreciate it.
joel
On Oct 5, 2007, at 7:20 a.m., John Arbash Meinel wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Aaron Bentley wrote:
>> Robert Collins wrote:
>>> This will work for most cases, and will address the number of copies
>>> problem substantially, but we may still fall down on merge, which is
>>> somewhat trickier to reduce memory usage on.
>>
>> The main issue with merge is the sequence matching, and this also
>> affects diff.
>>
>> If we can substitute shorter values for lines (e.g. hashes), we can
>> potentially reduce the memory footprint of sequence matching by an
>> order
>> of magnitude.
>>
>> Aaron
>>
>
> Just a quick comment. Depending on what we are doing, xdelta is
> also designed
> to handle incremental updates. (As is zlib, etc). The internals are
> a streaming
> interface. We would have to expose them (pyxdelta only exposes the
> 'compress/extract all at once' interface), but it would be possible
> to do.
>
> That doesn't help us for merging, or showing diffs to the user
> (which xdelta is
> not suitable for). But it would help with commit, and text extraction.
>
> But in general, we would need to have a lot of our codebase updated
> to handle
> the streaming concepts. Though working around the concept of a
> "text iterator"
> might do well enough. (Which could be a list of lines, or a file
> object, or a
> custom object that reads chunks at a time.)
>
> John
> =:->
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.7 (Darwin)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>
> iD8DBQFHBkhKJdeBCYSNAAMRAlWPAJwMho5hzxHI/WI3GeX5f9QdeAgkIgCfftSE
> f+/IEmo24Q7fnLPmtDP9jp4=
> =jVUR
> -----END PGP SIGNATURE-----
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (Darwin)
iD8DBQFHBpN1CE/I8ZpjjGIRAvAoAJ9SmwyCzTEYfigMJZYWqvhsXM/e2QCdE1jr
0ylmPwWBS9SHMEAXoJ+QOW0=
=WcIC
-----END PGP SIGNATURE-----
More information about the bazaar
mailing list