Reducing peak memory for commit

Mon Apr 19 22:34:05 BST 2010

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

While working on refactoring PackCollection, I came across some
inefficiencies in how we handle bytes.

A while back, I worked on decreasing the peak memory consumption for
commit, and I think I managed to actually get it down to a single copy
of the uncompressed content of a file.

However, it appears that we currently (as of bzr.dev 5165) have
1-uncompressed copy, and 3-compressed copies. For content that doesn't
compress well, this can be significant (for incompressible content, this
is a 4x peak memory requirement.)

See also bug #109114

I can reproduce this with:

 dd if=/dev/urandom of=1mb bs=1024 count=1024
 echo "start of text" > 80mb
 for i in `seq 80`; do cat 1mb >> 80mb; done

 bzr init .
 bzr add 80mb
 bzr commit -Dmemory -m "80mb file"

Note that 1MB of random data is too big for zlib to do anything with, so
the compressed size is still ~80MB.

Peak memory for the commit is ~360MB.

Dumping the memory during the 'write to pack file' steps, shows 4 80MB
strings.

1: 'B83911719\n\ngcb1z\n83911695\n83886099\nx\x9c\x00\x05@\xfa\xbff...
2: 'gcb1z\n83911695\n83886099\nx\x9c\x00\x05@\xfa\xbff\x8e\x80\x80(...
3: 'x\x9c\x00\x05@\xfa\xbff\x8e\x80\x80(start of text\n\xc9\x92\xe1...
4: 'start of text\n\xc9\x92\xe1\xa6\xac[\x85\xf6_9\xad\xbb\x8c\xc7...

4: appears to be the raw content of the file, stored in
 GroupCompressBlock._content_chunks = ['f', '\x8e\x80\x80', <4>]

3: appears to be
 GroupCompressBlock._z_content = <3>

 Which should be the zlib compressed content.

So it seems that GroupCompressBlock holds on to both _content_chunks and
_z_content simultaneously. Also note that:
 GroupCompressBlock._create_z_content_from_chunks is:

    def _create_z_content_from_chunks(self):
        compressor = zlib.compressobj(zlib.Z_DEFAULT_COMPRESSION)
        compressed_chunks = map(compressor.compress,
                                self._content_chunks)
        compressed_chunks.append(compressor.flush())
        self._z_content = ''.join(compressed_chunks)
        self._z_content_length = len(self._z_content)

The second to last line potentially allocates 2x compressed-size memory.
I think it is reasonable to 'consume' the chunks as we do the
compression. Perhaps something like:

  chunks = self._content_chunks
  self._content_chunks = None
  chunks.reverse()
  while chunks:
    next = chunks.pop()
    compressed_chunks.append(compressor.compress(next))
  next = None
  compressed_chunks.append(compressor.flush())
  self._z_content = ''.join(compressed_chunks)

I wonder if collapsing the compressed content is worthwhile. Or whether
we should be using self._z_content_chunks as the interface instead. Note
that 95% of the time when reading content, we have it as a single
string. And that the code is written so that we *can* decompress a
freshly compressed block. (In case someone writes it and then asks to
read it again.)

2: appears to be a group-compress byte-stream header, followed by the
   _z_content. This appears to be generated in
   GroupCompressBlock.to_bytes():

    def to_bytes(self):
        """Encode the information into a byte stream."""
        self._create_z_content()
        if _USE_LZMA:
            header = self.GCB_LZ_HEADER
        else:
            header = self.GCB_HEADER
        chunks = [header,
                  '%d\n%d\n' % (self._z_content_length,
                                self._content_length),
                  self._z_content,
                 ]
        return ''.join(chunks)

I think it is part of the design that 'to_bytes' does not cause
GroupCompressBlock to stop referencing the data. (_z_content = None).

We have another ''.join(chunks) which is a 2x memory requirement.
'chunks' goes away pretty quickly, though.

1: Appears to be the group-compress byte-stream accompanied with the
Pack header. This is generated in:

 bzrlib.pack.ContainerWriter.add_bytes_record
which calls:
 bzrlib.pack.ContainerSerialiser.bytes_record(bytes, names)
which creates a size-of-record and then post-fixes a bunch of 'name'
fields and has this big XXX:

>         # XXX: This causes a memory copy of bytes in size, but is usually
>         # faster than two write calls (12 vs 13 seconds to output a gig of
>         # 1k records.) - results may differ on significantly larger records
>         # like .iso's but as they should be rare in any case and thus not
>         # likely to be the common case. The biggest issue is causing extreme
>         # memory pressure in that case. One possibly improvement here is to
>         # check the size of the content before deciding to join here vs call
>         # write twice.
>         return ''.join(byte_sections)

Note that ContainerSerialiser.write_func is actually a closure in
NewPack/GCPack that looks like:

        def _write_data(bytes, flush=False, _buffer=self._buffer,
            _write=self.write_stream.write, _update=self._hash.update):
            _buffer[0].append(bytes)
            _buffer[1] += len(bytes)
            # buffer cap
            if _buffer[1] > self._cache_limit or flush:
                bytes = ''.join(_buffer[0])
                _write(bytes)
                _update(bytes)
                _buffer[:] = [[], 0]

So it, once again, can cause a buffer to get joined. When I was doing
commit, self._cache_limit was always 0, so it never did, though.

My feeling is that a good way to handle this would be to change the
Transport.open_write_stream() api, such that the returned object must
have a 'writelines()' attribute (rather than just .write()).

And then to go through all these stacks and change them to use 'chunks'
rather than 'bytes'.

It should mean that in a lot of places where we currently do:

buffer = []
buffer.append('my new header')
buffer.append(passed_in_content)
return ''.join(buffer)

we can just do:

buffer = []
buffer.append('my new header')
buffer.extend(passed_in_chunks)
return buffer

And then the final call is to file.writelines() (which is actually a
'chunked' interface, it does not add '\n' itself. The original idea
being that x.writelines(y.readlines()) would work, but it still works
for our purpose.)

Note there are a couple more *possible* memory duplications that we are
'lucky' are not triggered right now.

_DirectPackAccess.add_raw_records:
          for key, size in key_sizes:
            p_offset, p_length = self._container_writer.add_bytes_record(
                raw_data[offset:offset+size], [])
            offset += size
            result.append((self._write_index, p_offset, p_length))

This takes the raw data buffer, and slices it back up into little
buffers to get written to the output. It doesn't affect us today because:

 a) When we commit we create a single GCB per text being committed, so
    we always only add 1 record at a time.
 b) (raw_data[x:y] is raw_data) when x == 0 and y >= len(raw_data)
    (PyString doesn't chop itself up if you slice it to full size)

GCPack/NewPack _write_data
Buffers if self._cache_limit is >0, and uses ''.join() before writing
the buffered data out. Doesn't affect us today because:

 a) _cache_limit is always 0 during commit.

If/when we get commit working via in 'insert_record_stream' sort of api,
both of these will be an issue.

There is also an extra issue during 'bzr pack' (and autopack, etc),
which is that the GroupCompress code uses a data structure that is
pretty 'loose' with memory, assuming that it won't exist for very long.
It does extra bad if the content being compressed is repetitive, because
hash groups are always given N free spaces. Repetitive data causes the
one group to overflow more often, while the other groups stay empty.

I don't have any sort of ETA on when we could fix this, or a great idea
of how much effort it is. One problem is that it effects a lot of
layers, which causes API friction/churn.

I do have one quick patch which should make it, 3x compressed size, vs
1x raw + 3x compressed.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkvMzE0ACgkQJdeBCYSNAAOpAACgvcRBKGmPGAf5XviFI2D6zMNm
fT8Anj2Qk5aHXGKOjLuEAK9nHIAMFLTf
=fPXV
-----END PGP SIGNATURE-----