[MERGE] get_record_stream().get_bytes_as('chunked')

Thu Dec 11 03:33:20 GMT 2008

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert's get_record_stream() api is very nice in allowing multiple ways
to represent data. However, it added a bit of friction for certain apis.
Specifically, you either use the data in 'raw' form, or you cast it up
to 'fulltext'. But often, the source has it in the form of 'lines' and
the target wants to use 'lines'.

Further, when extracting many texts from a single history (like say lots
of inventories), the extraction code is very good about sharing the
various strings, but once you cast up to a fulltext, all those bytes get
reallocated.

In talking in the past, we felt that "chunked" may be the best form. The
idea is that "chunked" is any collection of strings, such that if you
want a fulltext, you can just do ''.join(chunks).

The reason to use this is because if you have a fulltext, then you can
trivially convert to chunks by wrapping in a tuple. If you have lines,
then you also have chunks already. And finally, if you want to read a
file 50kB at a time, without regard for line endings, you again have a
valid list of chunks.

As one further step, I wrote a "chunks_to_lines()" adapter. I wanted to
handle the case where source and target both talk in terms of 'lines',
but we are labeling them 'chunks'.

At first, I thought I might be able to get away with just a python
implementation, but the best I could do was 2.5ms for NEWS, while a
compiled version was 0.45ms.

I decided to work on this right now, because I found that
"repo.revision_trees(revs[:100])" was consuming 300MB of RAM just for
the byte strings of the inventory texts. With this patch, the peak
memory consumption is just 59MB.

The only parts I didn't work on was adding a new group of adapters that
can convert from whatever into "chunked" rather than "fulltext".
The other question is whether it should be "get_bytes_as('chunked')" or
"get_bytes_as('chunks')".

I called the function "chunks_to_lines()" so it is a little bit nicer to
have:  chunks_to_lines(get_bytes_as('chunks'))
Thoughts?

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAklAigAACgkQJdeBCYSNAAP7lwCeOmdgMomhCpA2DXpcseefdi1W
r3kAoISC9qeei3WluyaVFu2ueBBVwtX2
=khrh
-----END PGP SIGNATURE-----