[MERGE] get_record_stream().get_bytes_as('chunked')

Thu Dec 11 18:44:46 GMT 2008

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Vincent Ladeuil wrote:
>>>>>> "jam" == John Arbash Meinel <john at arbash-meinel.com> writes:
> 
> <snip/>
> 
>     jam> +        newline = <char *>memchr(c_str, c'\n', the_len)
>     jam> +        if newline != c_last and not (newline == NULL and cur == chunks_len):
>     >> 
>     >> Why don't you keep that construct in python ?
> 
>     jam> What construct is that?
> 
> The single iteration with a single test instead of special casing
> the last item (you may need to use enumerate in the list
> comprehension to get 'cur').
> 
>     jam> I basically kept iterating the python one until I could
>     jam> find the fastest ops I could. Then I did a quick test
>     jam> pyrex implementation, and realized it was 5x faster than
>     jam> the best python I could write, so I spent more time on
>     jam> it.
> 
> Not a big deal, I'd prefer a more readable python version anyway.
> 
>         Vincent
> 

I went with the attached. The python version is marginally slower, but I
imagine it is easier to read.

I also found a way to shave another 10% off the pyrex implementation.

It seems that PyString_AsStringAndSize() can actually convert the object
type. So using "PyString_CheckExact()" and then the macros
"PyString_AS_STRING" and "PyString_GET_SIZE" is significantly faster.
(140us => 100us for doing 7.5k lines).

I also encountered some interesting gcc optimizations. Where if you
don't make use of the result of memchr() gcc can actually optimize away
the function entirely. For example, doing:

for XXX:
  newline = memchr(c_str, '\n', len)
  if not newline:
    continue

gcc is smart enough to notice that it takes the same path whether
'newline' is evaluated or not, and thus it doesn't need the result, and
thus it skips the whole end of the function.

I also ended up implementing the split code in pyrex. Which is about
1.2ms for splitting all of NEWS rather than 3ms for a
split_lines(''.join()) or 2ms for just split_lines(text)

It also has the advantage of re-using any chunks that are already lines
even if all of them are not.

At the moment, it uses 2 loops, which has the disadvantage that you
can't use an iterator. Which reminds me of something I was originally
hoping for in a "chunked" interface. Namely, streaming. Being able to
process the current chunk while waiting for the next chunk.

For example, imagine updating the working tree using "iter_files_chunks"
rather than "iter_files_bytes". We could stream the bytes for each file
out, rather than having to hold the entire content in memory.

It wouldn't be hard to change the code to do so. In Pyrex you need to
write an iterable class, but you basically just store the state on the
class, and move the while loop to being polled rather than driven directly.

If I change the code to be a single loop, it costs 50% more for the case
that everything is fine. (You build up a list and then throw it away).
That isn't great, but it is also only 0.2ms. If we really cared, we
could use an isinstance check and do a double-loop if we have a list or
tuple, and skip the exact check otherwise.
I suppose if we were using a streaming interface, you may not want to
return the exact object. (Though you could have the optimization that if
list and exactly lines, return the list, rather than the custom iterator.)

This is actually fast enough, that I would consider re-implementing
osutils.split_lines() in terms of chunks_to_lines. (Timing shows it to
be about 2x faster.)

Anyway, I think this is getting us moving in a direction we want to go.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAklBX54ACgkQJdeBCYSNAAMACACgzv3v25XIwLbC77gxtIBKu5xT
U5sAoMviiwvawwSPtG3pBcUUog+qfuL4
=yzkv
-----END PGP SIGNATURE-----
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: get_record_stream_chunked.patch
Url: https://lists.ubuntu.com/archives/bazaar/attachments/20081211/3cee6493/attachment-0001.diff